Introduction

Wine is one of the oldest inventions that is still enjoyed around the world today. While the process used to make it has changed over time based on the creation of new technologies and automation, the basic steps remain the same. When most people think of the origins of wine, they probably think of ancient Greece or Rome. However, the earliest evidence dates back to 7,000 BC in China where people enjoyed fermented grape based drinks, and the first true wine was drank in Georgia in 6,000 BC. It was not until much later that the Greeks and Romans began their wine-making legacies. If it were not for the widespread trade routes created by the Phoenician civilization, the Greeks and Romans would not have encountered wine at all.

While the Romans may have not invented wine, they certainly took great steps in perfecting it. They were the first to use barrels and glass bottles in the wine making and storing process, and many of their techniques are still used as the fundamentals of wine-making today. When the Roman Empire fell, their process was preserved by the Catholic Church until it was widely spread throughout the modern world. When wine-making became popular in America, it was viewed as having much lower quality until the late 20th century. In the last 50 years, vineyards have been opening and operating all throughout the United States with the most activity on the east and west coasts.

There are five basic steps in the wine-making process.

  1. The grapes are picked
  2. The grapes are crushed
  3. The grape juice is fermented into wine
  4. The wine is aged in barrels
  5. The wine is bottled for sale

It may be obvious that the type of grape used in the wine-making process would influence the quality and flavor, but the specific process used will influence that as well. For example, the sugar levels in the grapes can change depending on if they are harvested during the day or at night. While the goal is to completely filter the juice before fermentation, if any particles are left behind it could influence the flavor as well. Fermentation requires the addition of yeast, and there are many different yeast strains and recipes that can change the final wine product. There are also many different options for aging the wine. The wine can be aged for different lengths of time, in different types of barrels, and with those barrels having varying amounts of char on the wood.

With all of these different steps and types of grapes available, it is easy to see how so many different types of wines are created. All of those types have varying degrees of quality and different flavors as well. It is a fairly common experience to try two different wines of the same type, such as Chardonnay, and really like one and not the like the other at all. But why does this occur? If a person enjoys the taste of Chardonnay why don’t they like every bottle of Chardonnay? It all boils down to the basic chemical properties that make up each wine. Even with strict controls, these will vary from bottle to bottle and vineyard to vineyard. Many vineyards will experiment with the values of the chemical properties to try to create new flavors and improved quality. Some properties have a tendency to influence the flavor of wine more than others, but can those properties be used to predict the quality of a finished product?

(reference: https://storymaps.arcgis.com/stories/c5a01856223745d19ee5a2f640624b83)

(reference: https://winefolly.com/deep-dive/how-wine-is-made-in-pictures/)

Analysis and Models

The purpose of this investigation is to understand how the physiochemical properties of white wine influence its quality and then predict the quality of white wine based on the values of those properties. This section will feature an exploration of the data set along with visualizations of all of the variables. After the data set has been cleaned and prepared for analysis, a variety of supervised and unsupervised machine learning techniques will be applied to the data.

About the Data

Before discussing the data set, the following packages are loaded into R to be used for visualizations and analyses.

#Load in all the libraries.
library(tidyverse)
library(e1071)
library(randomForest)
library(randomForestExplainer)
library(factoextra)
library(cluster)
library(caret)
library(kernlab)
library(rpart.plot)
library(rpart)
library(rattle)
library(party)
library(FactoMineR)
library(Rtsne)
library(plotly)
library(class)
library(naivebayes)
library(knitr)
library(GGally)
library(classInt)
library(arules)
library(arulesViz)
library(reshape2)
library(BBmisc)
library(dendextend)
library(proxy)
library(tm)
library(slam)

The data set is based on the results of quality tests of the Portuguese Vinho Verde white wine. This wine comes from a small region in Northern Portugal and is known for its great value and popularity as a “summer wine”. The input variables are the values of the various physiochemical properties of the wine as measured by precise scientific tests. The output variable is the quality of the wine rated on a scale from 0 to 10 by a panel of wine experts. The final quality score is the median of at least three different expert scores.

(reference: https://winefolly.com/deep-dive/vinho-verde-the-perfect-poolside-wine-from-portugal/)

The white wine quality data set is stored in the UCI Machine Learning Repository and is sourced from Pablo Cortez of the University of Minho in Portugal in 2009.

(citation: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.)

The data set is stored as a semi-colon delimited file. This means it must be read into R using the read_delim function, setting the delimiter parameter to be the semicolon.

#Read in the data set.
wine <- read_delim("~/Documents/Syracuse/Quarter 2/Applied Machine Learning/winequality-white.csv", delim = ";")
## Rows: 4898 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## dbl (12): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Show the structure of the data set.
str(wine)
## spec_tbl_df [4,898 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ fixed acidity       : num [1:4898] 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile acidity    : num [1:4898] 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric acid         : num [1:4898] 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual sugar      : num [1:4898] 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num [1:4898] 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free sulfur dioxide : num [1:4898] 45 14 30 47 47 30 30 45 14 28 ...
##  $ total sulfur dioxide: num [1:4898] 170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num [1:4898] 1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num [1:4898] 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num [1:4898] 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num [1:4898] 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : num [1:4898] 6 6 6 6 6 6 6 6 6 6 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `fixed acidity` = col_double(),
##   ..   `volatile acidity` = col_double(),
##   ..   `citric acid` = col_double(),
##   ..   `residual sugar` = col_double(),
##   ..   chlorides = col_double(),
##   ..   `free sulfur dioxide` = col_double(),
##   ..   `total sulfur dioxide` = col_double(),
##   ..   density = col_double(),
##   ..   pH = col_double(),
##   ..   sulphates = col_double(),
##   ..   alcohol = col_double(),
##   ..   quality = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The data set contains 12 numeric variables. The first 11 variables are the input variables which contain the values of the 11 physiochemical properties included in the data set. The 12th variable is the output variable which is the wine quality score.

head(wine)
## # A tibble: 6 × 12
##   `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
##             <dbl>              <dbl>         <dbl>            <dbl>     <dbl>
## 1             7                 0.27          0.36             20.7     0.045
## 2             6.3               0.3           0.34              1.6     0.049
## 3             8.1               0.28          0.4               6.9     0.05 
## 4             7.2               0.23          0.32              8.5     0.058
## 5             7.2               0.23          0.32              8.5     0.058
## 6             8.1               0.28          0.4               6.9     0.05 
## # … with 7 more variables: `free sulfur dioxide` <dbl>,
## #   `total sulfur dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>

Viewing the first six rows of data gives an initial impression of the data. Each of the physiochemical property variables contains numerical data and the wine quality rating is also presented as a number between 0 and 10.

#Number of rows and columns
dim(wine)
## [1] 4898   12

There are 4,898 rows in the data set. Each row of the data represents a unique wine sample that was taste-tested by the expert panel.

#Are there any missing values?
sum(is.na(wine))
## [1] 0
wine[rowSums(is.na(wine)) > 0,]
## # A tibble: 0 × 12
## # … with 12 variables: fixed acidity <dbl>, volatile acidity <dbl>,
## #   citric acid <dbl>, residual sugar <dbl>, chlorides <dbl>,
## #   free sulfur dioxide <dbl>, total sulfur dioxide <dbl>, density <dbl>,
## #   pH <dbl>, sulphates <dbl>, alcohol <dbl>, quality <dbl>

This data set has no missing values. Every value for each physiochemical property was measured and recorded. Also, there are no wine samples that do not have a corresponding quality rating.

Before any analysis, the column names of the data set will be cleaned. Presently, the column names are multiple words with spaces. These names are very unwieldy for writing efficient R code. Therefore, each column will be renamed without including any spaces.

#Rename the columns so there are no spaces for ease of use.
wine <- wine %>%
  rename(FixedAcidity = 'fixed acidity',
         VolatileAcidity = 'volatile acidity',
         CitricAcid = 'citric acid',
         ResidSugar = 'residual sugar',
         Chlorides = 'chlorides',
         FreeSO2 = 'free sulfur dioxide',
         TotalSO2 = 'total sulfur dioxide',
         Density = 'density',
         Sulfates = 'sulphates',
         PercentAlc = 'alcohol',
         WineQuality = 'quality'
  )

The following table summarizes the high level information for each of the variables.

#Table summarizing variable information
variable_names <- c("Fixed Acidity", "Volatile Acidity", "Citric Acid", "Residual Sugar", "Chlorides", "Free Sulfur Dioxide", "Total Sulfur Dioxide", "Density", "pH", "Sulfates", "Percent Alcohol", "Wine Quality")
variable_description <- c("The measure of tartaric acid in g/L", "The measure of acetic acid in g/L", "The measure of citric acid in g/L", "The measure of sugar after fermentation in g/L", "The measure of salt in g/L", "The measure of free SO2 in mg/L", "The total measure of SO2 in mg/L", "The density of the wine in g/ml", "The pH rating of the wine on a scale of 0 to 14", "The measure of potassium sulfate in g/L", "The percent alcohol content of the wine", "The quality rating of the wine on a scale of 0 to 10")
variable_table <- data.frame(variable_names, variable_description)
kable(variable_table, col.names = c("Variable", "Description"), caption = "Description of each variable")
Description of each variable
Variable Description
Fixed Acidity The measure of tartaric acid in g/L
Volatile Acidity The measure of acetic acid in g/L
Citric Acid The measure of citric acid in g/L
Residual Sugar The measure of sugar after fermentation in g/L
Chlorides The measure of salt in g/L
Free Sulfur Dioxide The measure of free SO2 in mg/L
Total Sulfur Dioxide The total measure of SO2 in mg/L
Density The density of the wine in g/ml
pH The pH rating of the wine on a scale of 0 to 14
Sulfates The measure of potassium sulfate in g/L
Percent Alcohol The percent alcohol content of the wine
Wine Quality The quality rating of the wine on a scale of 0 to 10

Next, each variable will be discussed in detail and the distribution of its values as they pertain to this data set will be visualized. These discussions will include a description of each physiochemical property along with details of how it impacts the wine, while giving a sense of the typical values of each that are found in wine.

Input Variable 1: Fixed Acidity

The fixed acidity variable contains the measure of the amount of tartaric acid present in the wine in grams per liter. This acid is non-volatile, meaning it does not just evaporate on its own. Tartaric acid reduces the pH value of the wine, provides a tart flavor, and has some influence on the feel and color of the wine.

(reference: https://chemicalstore.com/tartaric-acid-in-wine/#:~:text=Tartaric%20acid%20is%20used%20in,feel%2C%20and%20color%20of%20wines)

summary(wine$FixedAcidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The summary function provides a numerical synopsis of the distribution of the values for the fixed acidity variable. At minimum, there are 3.8 grams per liter of tartaric acid, and at maximum, there are 14.2 grams per liter of tartaric acid. However, the first and third quartiles reveal that 50% of the the wine samples have between 6.3 and 7.3 grams of tartaric acid per liter.

#Box plot of the distribution of fixed acidity
wine %>%
  ggplot(aes(x=FixedAcidity)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of Tartaric Acid (g/L)") +
  ggtitle("Fig.1a Distribution of Measures of Tartaric Acid")

#Bar graph showing distribution of fixed acidity
wine %>%
  ggplot(aes(x=FixedAcidity)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of Tartaric Acid (g/L)", y = "Count of Wine Samples") +
  ggtitle("Fig.1b Distribution of Measures of Tartaric Acid")

Figures 1a and 1b visualize the distribution of the values of this variable in two different ways. The box plot reveals the presence of several outliers, represented as the dots on the left and right sides of the whiskers. The bar graph reveals that the greatest number of wine samples have around 6 to 7 grams of tartaric acid per liter, while only very few samples have greater than 9 grams per liter. The distribution is fairly close to normal, with maybe only a very slight skew to the right.

Input Variable 2: Volatile Acidity

The volatile acidity variable contains the measure of the amount of acetic acid present in the wine in grams per liter. Acetic acid is produced during or after the fermentation process due to the activity of yeast cells. Too much acetic acid can lead to a vinegar taste.

(reference: https://www.randoxfood.com/why-is-testing-for-acetic-acid-important-in-winemaking/#:~:text=Acetic%20acid%20is%20a%20two,small%20amount%20of%20acetic%20acid)

summary(wine$VolatileAcidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The levels of acetic acid tend to be much lower than those of the tartaric acid. The minimum value is 0.08 grams per liter and the maximum value is 1.1 grams per liter. Half of the wine samples have acetic acid levels between 0.21 and 0.32 grams per liter.

# Box plot of the distribution of volatile acidity
wine %>%
  ggplot(aes(x=VolatileAcidity)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of Acetic Acid (g/L)") +
  ggtitle("Fig.2a Distribution of Measures of Acetic Acid")

#Bar graph showing the distribution of volatile acidity
wine %>%
  ggplot(aes(x=VolatileAcidity)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of Acetic Acid (g/L)", y = "Count of Wine Samples") +
  ggtitle("Fig.2b Distribution of Measures of Acetic Acid")

Figures 2a and 2b visualize the distribution of the values of the volatile acidity variable. The box plot reveals that there are many outliers, but all of them are values that are greater than the rest of the values in the data set. The median acetic acid value is a little less than 0.3 grams per liter. The bar graph reveals that the great majority of the values are less than 0.3 grams per liter, with much fewer wine samples containing more than 0.6 grams per liter. This distribution is strongly skewed to the right.

Input Variable 3: Citric Acid

The citric acid variable contains the measures of citric acid present in the wine in grams per liter. Citric acid increases the acidity of the wine, while adding flavor and freshness by removing excess iron. As the level of citric acid increases, the ability of the wine to keep longer is decreased.

(reference: https://wineserver.ucdavis.edu/industry-info/enology/methods-and-techniques/common-chemical-reagents/citric-acid#:~:text=Citric%20acid%20is%20often%20added,acid%20is%20its%20microbial%20instability)

summary(wine$CitricAcid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Overall, the values of citric acid present in the wine samples are quite low. There are samples that contain no citric acid, and the maximum amount is only 1.66 grams per liter. Half of the wine samples contain between 0.27 and 0.39 grams per liter.

#Box plot of the distribution of citric acid
wine %>%
  ggplot(aes(x=CitricAcid)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of Citric Acid (g/L)") +
  ggtitle("Fig.3a Distribution of Measures of Citric Acid")

#Bar graph showing the distribution of citric acid
wine %>%
  ggplot(aes(x=CitricAcid)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of Citric Acid (g/L)", y = "Count of Wine Samples") +
  ggtitle("Fig.3b Distribution of Measures of Citric Acid")

Figures 3a and 3b show the distribution of the values of this variable. The box plot reveals an interesting trend in the data. While the majority of the wine samples have between 0 and 0.5 grams of citric acid per liter, there are outliers on both sides of the data. The maximum value is a fairly extreme outlier and is much greater than any of the other values. The bar graph also reveals that while the majority of wine samples contain about 0.3 grams per liter, there is a large secondary spike that occurs right below the value of 0.5 grams per liter. Overall, the distribution is skewed right because of the presence of the outliers in the upper half of the data.

Input Variable 4: Residual Sugar

The residual sugar value contains the measures of sugar present in the wine in grams per liter. The residual sugar is the amount of naturally occurring grape sugars that remain in the wine after the fermentation process. Higher residual sugar levels result in sweeter wine. Typically, a dry wine will have between 0 and 4 grams of residual sugar per liter and a sweet wine will have greater than 35 grams of residual sugar per liter.

(reference: https://whicherridge.com.au/blog/what-is-residual-sugar-in-wine/#:~:text=Sweetness%20in%20wine%20is%20called,the%20sweeter%20the%20wine%20is)

summary(wine$ResidSugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The values of residual sugar have an extremely large range compared to the previous variables. The least amount of sugar is 0.6 grams per liter, resulting in a very dry wine, and the greatest amount of sugar is 65.8 grams per liter, resulting in a very sweet wine. Based on the definitions of dry and sweet wine, the median value of 5.2 reveals the data set is made up mostly of dry wines and wines that are semi-sweet, but towards the drier end of this in-between categorization. The following plots reveal there is only one wine sample that meets the strict criteria of being a sweet wine. This has caused the data to be very skewed to the right.

#Box plot of the distribution of residual sugar
wine %>%
  ggplot(aes(x=ResidSugar)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of Residual Sugar (g/L)") +
  ggtitle("Fig.4a Distribution of Measures of Residual Sugar")

#Bar graph showing the distribution of residual sugar
wine %>%
  ggplot(aes(x=ResidSugar)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of Residual Sugar (g/L)", y = "Count of Wine Samples") +
  ggtitle("Fig.4b Distribution of Measures of Residual Sugar")

Figures 4a and 4b confirm what the summary function initially revealed. The great majority of the data is grouped towards the drier end of the wine spectrum. There is a giant cluster of very dry wines, followed by a fairly uniform group of those drier semi-sweet wines. The distribution is very strongly skewed to the right.

Input Variable 5: Chlorides

The chlorides variable measures the amount of sodium chloride, also known as salt, that is present in the wine in grams per liter. The amount of salt will affect the taste of the wine to some degree.

summary(wine$Chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The summary function reveals that there is very little salt present in wine. At minimum there is just 0.009 grams per liter and at maximum there is also just 0.346 grams per liter. This makes sense based on real world knowledge of wines, seeing as there are no “salty wines”.

#Box plot of the distribution of chlorides
wine %>%
  ggplot(aes(x=Chlorides)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of Sodium Chloride (g/L)") +
  ggtitle("Fig.5a Distribution of Measures of Sodium Chloride")

#Bar graph showing the distribution of chlorides
wine %>%
  ggplot(aes(x=Chlorides)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of Sodium Chloride (g/L)", y = "Count of Wine Samples") +
  ggtitle("Fig.5b Distribution of Measures of Sodium Chloride")

Figure 5a reveals that this variable features a massive amount of outliers. The entire box and whisker portion of the plot is contained within 0 and 0.1 grams per liter. Figure 5b confirms that there is a giant grouping of wine samples between about 0.02 and 0.07 grams per liter. The distribution is extremely skewed to the right, with some samples extending past 0.3 grams per liter.

Input Variable 6: Free Sulfur Dioxide

The free sulfur dioxide (SO2) variable contains the measures of the amount of free SO2 present in the wine in milligrams per liter. Free SO2 refers to SO2 that is not bound to any other compounds in the wine. This helps prevent microbial growth and oxidation that would cause the wine to spoil, but too much results in a bitter or metallic flavor.

(reference: https://www.extension.iastate.edu/wine/total-sulfur-dioxide-why-it-matters-too/#:~:text=In%20winemaking%2C%20the%20use%20of,wine%20from%20oxidation%20and%20spoilage)

summary(wine$FreeSO2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The summary function reveals that the free SO2 values also have a very wide range. The minimum value is 2 milligrams per liter, while the maximum value is 289 milligrams per liter. The median, however, is 34 milligrams per liter which indicates that the 289 mg/L wine sample is most likely a fairly extreme outlier.

#Box plot of the distribution of free SO2
wine %>%
  ggplot(aes(x=FreeSO2)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of Free SO2 (mg/L)") +
  ggtitle("Fig.6a Distribution of Measures of Free SO2")

#Bar graph showing the distribution of free SO2
wine %>%
  ggplot(aes(x=FreeSO2)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of Free SO2 (mg/L)", y = "Count of Wine Samples") +
  ggtitle("Fig.6b Distribution of Measures of Free SO2")

Figure 6a confirms the hypothesis about the maximum value. It is an extreme outlier. Almost all of the data is contained between 0 and 100 milligrams per liter. There are also only outliers in the upper half of the data. Figure 6b reveals that the great majority of values occur between 0 and 50 milligrams per liter. The distribution is again skewed to the right because of the outliers in the upper half of the data.

Input Variable 7: Total Sulfur Dioxide

The total sulfur dioxide variable contains the measures of total SO2 present in the wine in milligrams per liter. The total SO2 is the sum of the amount of free SO2 and SO2 that is bound to other chemical compounds. Too much SO2 can affect the taste of the wine.

(reference: https://www.extension.iastate.edu/wine/total-sulfur-dioxide-why-it-matters-too/#:~:text=In%20winemaking%2C%20the%20use%20of,wine%20from%20oxidation%20and%20spoilage)

summary(wine$TotalSO2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Since the total SO2 level is directly related to the free SO2 level (total SO2 = free SO2 + bound SO2), it is unsurprising that these values are greater than the free SO2 values. The median value of total SO2 is 134 milligrams per liter, while the minimum is 9 milligrams per liter, and the maximum is 440 milligrams per liter.

#Box plot of the distribution of total SO2
wine %>%
  ggplot(aes(x=TotalSO2)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of Total SO2 (mg/L)") +
  ggtitle("Fig.7a Distribution of Measures of Total SO2")

#Bar graph showing the distribution of total SO2
wine %>%
  ggplot(aes(x=TotalSO2)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of Total SO2 (mg/L)", y = "Count of Wine Samples") +
  ggtitle("Fig.7b Distribution of Measures of Total SO2")

Although total and free SO2 are related, figures 7a and 7b reveal that the distribution of total SO2 is a bit different than the distribution of free SO2. There are now outliers on both ends of the range of data. There are also far fewer outliers, and the maximum value is closer to the rest of the data than it was for the distribution of free SO2. The bar graph visualizes how the majority of the data is between 100 and 200 milligrams per liter and the distribution is just slightly skewed right because of the presence of a few outliers.

Input Variable 8: Density

The density variable contains the measures of density of the wine samples in grams per milliliter. Typically, wine juice is more dense than water by about 8 to 9%, and alcohol is less dense than water by 20%. This results in the overall density of wine being generally less than that of water.

(reference: http://www.creativeconnoisseur.com/newsletter/files/497deafe6be1b2efc87df8ac6071e459-162.html)

summary(wine$Density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The range of the density measures of the wine samples is very narrow. This may result in this variable not being very impactful towards determining wine quality because it varies so little. The minimum density is 0.9871 grams per milliliter, while the maximum density is 1.039 grams per milliliter. As expected, the mean is slightly less than 1, which represents the density of water.

#Box plot of the distribution of density
wine %>%
  ggplot(aes(x=Density)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of Density (g/ml)") +
  ggtitle("Fig.8a Distribution of Measures of Density")

#Bar graph showing the distribution of density
wine %>%
  ggplot(aes(x=Density)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of Density (g/ml)", y = "Count of Wine Samples") +
  ggtitle("Fig.8b Distribution of Measures of Density")

Figure 8a confirms that the majority of the data is right between 0.99 and 1.00 grams per milliliter. There are just three outliers, all in the upper half of the data, with the maximum value being a particularly extreme outlier. Since wine juice is more dense than water, and alcohol is less dense, it can be concluded that these wine samples must have relatively low percent alcohol measurements on average. Figure 8b reveals an approximately normal distribution with just the few outliers as previously mentioned.

Input Variable 9: pH

The pH variable contains the pH values of the wine samples measured on the typical 0 to 14 logarithmic scale. On this scale, 0 is the most acidic and 14 is the most basic. Wines, in general, range from about 2.5 to 4.5 meaning they are fairly acidic beverages. The taste of the wine is affected by the pH value.

(reference: https://winefolly.com/deep-dive/understanding-acidity-in-wine/)

summary(wine$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The summary function reveals that these white wine samples are on the more acidic side of the typical wine pH range. The minimum value is 2.72 while the maximum value is just 3.82. Also, half of the wine samples have a pH between 3.09 and 3.28.

#Box plot of the distribution of pH
wine %>%
  ggplot(aes(x=pH)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of pH") +
  ggtitle("Fig.9a Distribution of Measures of pH")

#Bar graph showing the distribution of pH
wine %>%
  ggplot(aes(x=pH)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of pH", y = "Count of Wine Samples") +
  ggtitle("Fig.9b Distribution of Measures of pH")

There are outliers on both sides of the data as shown in figure 9a. The great majority of data is between 3.0 and 3.3. Figure 9b reveals a distribution is that is very close to a normal distribution. Even though there are outliers, they are balanced on the upper and lower ends of the data, allowing the distribution to retain its normal bell curve shape.

Input Variable 10: Sulfates

The sulfates variable contains the measurements of the amount of potassium sulfate present in the wine in grams per liter. Potassium sulfate is a wine additive that serves as an antimicrobial and antioxidant agent. This is added to the wine only to help it stay fresh for longer and is not intended to directly influence the taste of the wine. Although, the amount of potassium sulfate that is added can affect how it tatses.

summary(wine$Sulfates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The summary function reveals that white wines contain low levels of potassium sulfate. The minimum value is 0.22 grams per liter, while the maximum value is just 1.08 grams per liter. Half of the wine samples contain between 0.41 and 0.55 grams of potassium sulfate per liter.

#Box plot of the distribution of sulfates
wine %>%
  ggplot(aes(x=Sulfates)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of Potassium Sulfate (g/L)") +
  ggtitle("Fig.10a Distribution of Measures of Potassium Sulfate")

#Bar graph showing the distribution of sulfates
wine %>%
  ggplot(aes(x=Sulfates)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of Potassium Sulfate (g/L)", y = "Count of Wine Samples") +
  ggtitle("Fig.10b Distribution of Measures of Potassium Sulfate")

Figure 10a uncovers the presence of several outliers in the upper half of the data. The values of these outliers range from about 0.8 to a little more than 1.0 grams per liter. The majority of the data is between 0.4 and 0.6 grams per liter. Figure 10b reveals that the distribution is fairly skewed to the right because of those outliers.

Input Variable 11: Percent Alcohol

The percent alcohol value records the percent of alcohol by volume in each of the wine samples. Lower percentages would mean the wine is referred to as being “weaker” while greater percentages would mean the wine is referred to as being “stronger”. As the percent alcohol increases, one would taste the alcohol more strongly over the natural flavors of the wine juice, therefore affecting one’s enjoyment of the wine.

summary(wine$PercentAlc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The results of the summary function are expected based on common knowledge of wines sold in retail stores. The minimum percent alcohol is 8% and the maximum percent alcohol is 14.2%. Half of the wine samples contain between 10.4% and 11.4% alcohol.

#Box plot of the distribution of alcohol percentage
wine %>%
  ggplot(aes(x=PercentAlc)) +
  geom_boxplot(color = "black", fill = "lightblue") +
  theme(axis.text.y = element_blank()) +
  labs(x = "Measure of Alcohol Percentage") +
  ggtitle("Fig.11a Distribution of Measures of Alcohol Percentage")

#Bar graph showing the distribution of alcohol percentage
wine %>%
  ggplot(aes(x=PercentAlc)) +
  geom_bar(color = "black", fill = "lightblue") +
  labs(x = "Measure of Alcohol Percentage", y = "Count of Wine Samples") +
  ggtitle("Fig.11b Distribution of Measures of Alcohol Percentage")

Figure 11a reveals that the data contains no outliers, while the great majority of wine samples contain between 9% and about 11.5% alcohol. Although there are no outliers, the distribution is stil skewed to the right. Figure 11b shows how the number of wine samples gently decreases as the alcohol percentage is increased.

Output Variable: Wine Quality

The wine quality variable contains the quality ratings of the wine samples on a scale of 0 to 10, where 0 is the worst score possible and 10 is the best score possible. Each of these scores is the median of at least 3 separate scores each awarded to the wine after being tasted by a wine expert. These ratings should be viewed as an overall indication of how much someone would enjoy drinking the wine. It is also worth noting that this rating scale is not the same as the typical American grade scale where anything less than a 6 or 7 would be perceived as being bad. A ranking of 5 should be viewed as the true separator between better and worse wines.

The wine quality ratings are grouped and the number of wine samples with each rating is displayed.

#How many wine samples received each rating.
WineQualitySummary <- wine %>% 
  group_by(WineQuality) %>%
  summarize(Number = n())
WineQualitySummary
## # A tibble: 7 × 2
##   WineQuality Number
##         <dbl>  <int>
## 1           3     20
## 2           4    163
## 3           5   1457
## 4           6   2198
## 5           7    880
## 6           8    175
## 7           9      5

This summary reveals that the data set contains no wines that would be classified as extremely poor quality (less than a 3 rating) and that there are no “perfect” wines (a 10 rating). The number of wine samples receiving each rating also decreases moving outward from a rating of 6. This means the great majority of samples are of “ok” quality receiving a 5 or 6 rating, and there are much fewer samples of “good” wines receiving a rating of 7 or more, and “poor” wines receiving a rating of 4 or less. This reveals a potential limitation of the data set. With so few wines that are considered very good or very bad, the machine learning algorithms may have difficulty classifying these wines. Also, this summary provides some rationale for how the outliers should be dealt with. Since there are very few 3s and 9s, it reasons that these ratings may result from the data points that were outliers in some of the variables. Since the models should still have the opportunity to attempt to predict these very good and bad wines, the outlier values need to be left in the data set.

#Visualizing the data.
wine %>% 
  group_by(WineQuality) %>%
  summarize(Number = n()) %>%
  ggplot(aes(x = WineQuality, y = Number)) +
  geom_col(color = "black", fill = "lightblue") +
  scale_x_continuous("Wine Quality Rating", breaks = c(0:10), limits = c(0,10)) + 
  labs(y = "Count of Wine Samples") +
  ggtitle("Fig.12 Distribution of Wine Quality Ratings")

Figure 12 provides a visualization of the distribution of wine quality scores. Although the number of very good and very bad samples are lacking, the distribution is very close to normal.

Based on the wine rating process it makes sense that the scores are distributed in this way. Since each of these scores is the median of at least three separate scores, they tend towards the middle. The only way a wine would score a 10 is if every expert that tasted it awarded it a 10 which one could conceive would be a very rare event. The same is true of having a wine that would score a 0, 1, or 2. Those wines would have to be so bad that it is more likely that the batch would be pulled from the production process after internal testing, than it would to be bottled and sold.

Visualizing the Overall Data Set by Quality Rating

The following groups of plots show the distribution of each variable separated by the wine quality rating. The plots are stacked vertically and a red line is drawn through the median value. The median is selected as the representative measure of central tendency because the majority of variables had a skewed distribution and/or outliers.

The first group of plots displays the distribution of the fixed acidity variable by wine quality rating.

wine %>%
  ggplot(aes(x = FixedAcidity)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.13 Distribution of Fixed Acidity by Wine Quality") +
  labs(x = "Tartaric Acid in g/L", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
  geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(FixedAcidity)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(FixedAcidity)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(FixedAcidity)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(FixedAcidity)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(FixedAcidity)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(FixedAcidity)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(FixedAcidity)), color = "red")

These plots reveal that the median value remains fairly constant regardless of the wine quality rating. It is interesting that both the highest rated and lowest rated groups of wine samples have slightly higher fixed acidity than the other groups of wine samples. Looking at the distributions closely, this may be more of a product of a lack of data rather than some underlying pattern.

The following group of plots displays the distribution of the volatile acidity variable by wine quality rating.

wine %>%
  ggplot(aes(x = VolatileAcidity)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.14 Distribution of Volatile Acidity by Wine Quality") +
  labs(x = "Acetic Acid in g/L", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
    geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(VolatileAcidity)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(VolatileAcidity)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(VolatileAcidity)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(VolatileAcidity)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(VolatileAcidity)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(VolatileAcidity)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(VolatileAcidity)), color = "red")

Again, the median volatile acidity value remains fairly constant across the wine quality ratings, with most of the medians at a value just below 0.3 grams per liter.

The following group of plots displays the distribution of the citric acid variable by wine quality rating.

wine %>%
  ggplot(aes(x = CitricAcid)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.15 Distribution of Citric Acid by Wine Quality") +
  labs(x = "Citric Acid in g/L", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
    geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(CitricAcid)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(CitricAcid)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(CitricAcid)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(CitricAcid)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(CitricAcid)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(CitricAcid)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(CitricAcid)), color = "red")

The medain value of citric acid remains fairly constant regardless of wine quality rating. Although, this group again exhibits the behavior where the median values of the best and worst wines are slightly greater than the middle quality wines.

The following group of plots displays the distribution of the residual sugar variable by wine quality rating.

wine %>%
  ggplot(aes(x = ResidSugar)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.16 Distribution of Residual Sugar by Wine Quality") +
  labs(x = "Residual Sugar in g/L", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
    geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(ResidSugar)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(ResidSugar)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(ResidSugar)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(ResidSugar)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(ResidSugar)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(ResidSugar)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(ResidSugar)), color = "red")

The median value of residual sugar shows some variation depending on the wine quality rating. The wines rated a 5 or a 6 have slightly greater medians than the other groups of wine samples.

The following group of plots displays the distribution of the chlorides variable by wine quality rating.

wine %>%
  ggplot(aes(x = Chlorides)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.17 Distribution of Chlorides by Wine Quality") +
  labs(x = "Sodium Chloride in g/L", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
    geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(Chlorides)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(Chlorides)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(Chlorides)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(Chlorides)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(Chlorides)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(Chlorides)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(Chlorides)), color = "red")

The median values of the chlorides variable are fairly constant as well. Although, the middle quality wines have a slightly greater median value than the other groups.

The following group of plots displays the distribution of the free SO2 variable by wine quality rating.

wine %>%
  ggplot(aes(x = FreeSO2)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.18 Distribution of Free SO2 by Wine Quality") +
  labs(x = "Free SO2 in mg/L", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
    geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(FreeSO2)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(FreeSO2)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(FreeSO2)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(FreeSO2)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(FreeSO2)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(FreeSO2)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(FreeSO2)), color = "red")

The medians of the free SO2 variable are fairly uniform except for the wine samples with a quality rating of 4. The median value of this group is a bit lower than the other groups.

The following group of plots displays the distribution of the total SO2 variable by wine quality rating.

wine %>%
  ggplot(aes(x = TotalSO2)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.19 Distribution of Total SO2 by Wine Quality") +
  labs(x = "Total SO2 in mg/L", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
    geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(TotalSO2)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(TotalSO2)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(TotalSO2)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(TotalSO2)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(TotalSO2)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(TotalSO2)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(TotalSO2)), color = "red")

The median values of the total SO2 values show the greatest variation so far. The median values for the wines rated a 4, 7, 8, or 9 are very similar. The median values for the wines rated a 3 or a 5 are also more similar. The median value for the wines rated a 6 is between the median values of the two previously mentioned groups.

The following group of plots displays the distribution of the density variable by wine quality rating.

wine %>%
  ggplot(aes(x = Density)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.20 Distribution of Density by Wine Quality") +
  labs(x = "Density in g/ml", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
    geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(Density)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(Density)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(Density)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(Density)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(Density)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(Density)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(Density)), color = "red")

The median values of the density variable increase as the rating increases to 5, then decrease as the rating increases past 5 up to 9. it is interesting that the median value for the wines rated a 5 is the greatest, while the median values for the more highly rated wines are lower.

The following group of plots displays the distribution of the pH variable by wine quality rating.

wine %>%
  ggplot(aes(x = pH)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.21 Distribution of pH by Wine Quality") +
  labs(x = "pH", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
  geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(pH)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(pH)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(pH)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(pH)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(pH)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(pH)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(pH)), color = "red")

The median values for the pH take on somewhat of an inverted bell curve shape. The poorest and highest quality wines have the greatest pH values, while the middle quality wines have lower pH values.

The following group of plots displays the distribution of the sulfates variable by wine quality rating.

wine %>%
  ggplot(aes(x = Sulfates)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.22 Distribution of Sulfates by Wine Quality") +
  labs(x = "Potassium Sulfate in g/L", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
    geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(Sulfates)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(Sulfates)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(Sulfates)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(Sulfates)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(Sulfates)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(Sulfates)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(Sulfates)), color = "red")

The median values for the sulfates variable are fairly constant. The poorest quality wines and the highest quality wines have median sulfate levels that are a bit lower than the median values of the middle quality wines.

The last group of plots displays the distribution of the percent alcohol variable by wine quality rating.

wine %>%
  ggplot(aes(x = PercentAlc)) +
  geom_bar(aes(fill = WineQuality, color = WineQuality)) +
  ggtitle("Fig.23 Distribution of Percent Alcohol by Wine Quality") +
  labs(x = "Percent Alcohol", y = "Count of Wine Samples") +
  facet_wrap(~WineQuality, ncol = 1) +
    geom_vline(data = wine[wine$WineQuality == 3,], aes(xintercept = median(PercentAlc)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 4,], aes(xintercept = median(PercentAlc)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 5,], aes(xintercept = median(PercentAlc)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 6,], aes(xintercept = median(PercentAlc)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 7,], aes(xintercept = median(PercentAlc)), color = "red") +
geom_vline(data = wine[wine$WineQuality == 8,], aes(xintercept = median(PercentAlc)), color = "red") +
  geom_vline(data = wine[wine$WineQuality == 9,], aes(xintercept = median(PercentAlc)), color = "red")

The median values for percent alcohol vary more than any of the other variables. The lower rated wines, 3 to 5, have lower median alcohol contents, while the higher rated wines, 7 to 9, have higher median alcohol contents. This may reveal that the experts have a preference for stronger wines over weaker ones.

Based on the analysis of the variation of medians in each variable separated by wine quality group, there is no discernible correlation between wine quality and any one individual physiochemical property. Therefore, any clustering or classifying will depend on the relationship between the values of multiple properties for each wine sample, rather than focusing on just one.

Analysis of Variable Correlations

Based on the definitions of each variable, there are a few that stand out as candidates for those that might be linearly correlated to one another. The following plot displays a heat map of the correlation between the variables in the data. The Pearson correlation coefficient, r, is calculated and then squared to get the r^2 value which is then rounded to 2 places past the decimal point. These values are placed into a matrix and then visualized.

corr_matrix <- round(cor(wine)^2, 2)
melt_corr_matrix <- melt(corr_matrix)
ggplot(data = melt_corr_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(Var1, Var2, label = value), color = "white", size = 4) +
  ggtitle("Fig.24 Correlation Matrix Heat Map") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  labs(x = "Variable 1", y = "Variable 2")

From the heat map, it is obvious that most of the variables are not correlated to one another. However, density and residual sugar as well as density and percent alcohol are fairly strongly correlated with respective r^2 values of 0.7 and 0.61. Also worth noting is that total SO2 and free SO2 are somewhat weakly correlated with an r^2 value of 0.38, and density is weakly correlated to total SO2 with an r^2 value of 0.28.

Cleaning and Preparing the Data Set for Analyses

Since the wine quality variable is the output variable that will be the focus of all the analyses, it will be converted from a numeric variable to an ordered factor variable. This is necessary for all the types of models.

wine$WineQuality <- ordered(wine$WineQuality)

Next, separate versions of the data set will be prepared based on the requirements for each type of model that will be used in the analysis.

Cleaning and Preparing the Data for Association Rule Mining

First the data set will be duplicated under a different name so the original data set can be used as the starting point for the other types of data preparation as well.

#Copying the data set.
wineAR <- wine

Next, the density variable will be removed for the association rule mining data set. All of the values exist in a very narrow range and it is strongly correlated with two other variables that are in the data set. The density is in essence a calculation of those other variables. For example, density is dependent on the proportion of wine juice vs. alcohol which is already represented by the percent alcohol content.

#Feature Selection. Removing the density variable.
wineAR <- wineAR %>%
  select(-Density)

To perform association rule mining, all of the variables need to be discretized. This is the process of turning continuous variables into discrete variables by manually creating bins for the data values to be grouped into. Since there are seven different values for the wine quality variable, the other variables will be discretized into seven approximately equal sized bins. By matching the number of bins to number of different quality ratings, it will be interesting to see if certain bins always lead to certain quality ratings.

#Need to discretize all other variables. Going to make 7 bins to match 7 wine quality values.

#Discretize into approximately equal sized bins.
FixedAcidityBreaks <- classIntervals(wineAR$FixedAcidity, n = 7, style = "quantile")
FixedAcidityBreaks
## style: quantile
##   one of 99,795,696 possible partitions of this variable into 7 classes
##    [3.8,6)    [6,6.4)  [6.4,6.7)  [6.7,6.9)  [6.9,7.2)  [7.2,7.7) [7.7,14.2] 
##        574        721        796        544        675        854        734
wineAR$FixedAcidity <- cut(wineAR$FixedAcidity, breaks = c(3.8, 6, 6.4,
                                                           6.7, 6.9, 7.2,
                                                           7.7, 14.3),
                           labels = c("[3.8, 6)", "[6, 6.4)", "[6.4, 6.7)",
                                      "[6.7, 6.9)", "[6.9, 7.2)", "[7.2, 7.7)",
                                      "[7.7, 14.2]"),
                           right = FALSE)

VolatileAcidityBreaks <- classIntervals(wineAR$VolatileAcidity, n = 7, style = "quantile")
VolatileAcidityBreaks
## style: quantile
##   one of 4,465,475,476 possible partitions of this variable into 7 classes
## [0.08,0.18) [0.18,0.22) [0.22,0.25) [0.25,0.28) [0.28,0.31) [0.31,0.37) 
##         553         762         710         707         633         814 
##  [0.37,1.1] 
##         719
wineAR$VolatileAcidity <- cut(wineAR$VolatileAcidity, breaks = c(0.08, 0.18, 0.22,
                                                           0.25, 0.28, 0.31,
                                                           0.37, 1.2),
                           labels = c("[0.08, 0.18)", "[0.18, 0.22)", "[0.22, 0.25)",
                                      "[0.25, 0.28)", "[0.28, 0.31)", "[0.31, 0.37)",
                                      "[0.37, 1.1]"),
                           right = FALSE)

CitricBreaks <- classIntervals(wineAR$CitricAcid, n = 7, style = "quantile")
CitricBreaks
## style: quantile
##   one of 470,155,077 possible partitions of this variable into 7 classes
##    [0,0.24) [0.24,0.27)  [0.27,0.3)  [0.3,0.33) [0.33,0.37) [0.37,0.45) 
##         666         536         721         764         722         751 
## [0.45,1.66] 
##         738
wineAR$CitricAcid <- cut(wineAR$CitricAcid, breaks = c(0, 0.24, 0.27,
                                                      0.3, 0.33, 0.37,
                                                      0.45, 1.67),
                              labels = c("[0, 0.24)", "[0.24, 0.27)", "[0.27, 0.3)",
                                         "[0.3, 0.33)", "[0.33, 0.37)", "[0.37, 0.45)",
                                         "[0.45, 1.66]"),
                              right = FALSE)

SugarBreaks <- classIntervals(wineAR$ResidSugar, n = 7, style = "quantile")
SugarBreaks
## style: quantile
##   [0.6,1.4)   [1.4,1.9)   [1.9,4.1)   [4.1,6.6)   [6.6,8.9)  [8.9,12.8) 
##         659         701         737         701         691         707 
## [12.8,65.8] 
##         702
wineAR$ResidSugar <- cut(wineAR$ResidSugar, breaks = c(0.6, 1.4, 1.9,
                                                      4.1, 6.6, 8.9,
                                                      12.8, 65.9),
                              labels = c("[0.6, 1.4)", "[1.4, 1.9)", "[1.9, 4.1)",
                                         "[4.1, 6.6)", "[6.6, 8.9)", "[8.9, 12.8)",
                                         "[12.8, 65.8]"),
                              right = FALSE)

ChloridesBreaks <- classIntervals(wineAR$Chlorides, n = 7, style = "quantile")
ChloridesBreaks
## style: quantile
##   one of 20,398,507,129 possible partitions of this variable into 7 classes
## [0.009,0.032) [0.032,0.037) [0.037,0.041) [0.041,0.045) [0.045,0.049) 
##           671           726           666           673           696 
## [0.049,0.055) [0.055,0.346] 
##           751           715
wineAR$Chlorides <- cut(wineAR$Chlorides, breaks = c(0.009, 0.032, 0.037,
                                                       0.041, 0.045, 0.049,
                                                       0.055, 0.347),
                         labels = c("[0.009, 0.032)", "[0.032, 0.037)", "[0.037, 0.041)",
                                    "[0.041, 0.045)", "[0.045, 0.049)", "[0.049, 0.055)",
                                    "[0.055, 0.346]"),
                         right = FALSE)

FreeSO2Breaks <- classIntervals(wineAR$FreeSO2, n = 7, style = "quantile")
FreeSO2Breaks
## style: quantile
##   one of 6,249,655,776 possible partitions of this variable into 7 classes
##   [2,18)  [18,25)  [25,31)  [31,36)  [36,44)  [44,53) [53,289] 
##      670      690      712      612      792      687      735
wineAR$FreeSO2 <- cut(wineAR$FreeSO2, breaks = c(2, 18, 25,
                                                31, 36, 44,
                                                53, 290),
                         labels = c("[2, 18)", "[18, 25)", "[25, 31)",
                                    "[31, 36)", "[36, 44)", "[44, 53)",
                                    "[53, 289]"),
                         right = FALSE)

TotalSO2Breaks <- classIntervals(wineAR$TotalSO2, n = 7, style = "quantile")
TotalSO2Breaks
## style: quantile
##   [9,94.57143) [94.57143,112)      [112,126)      [126,143)      [143,162) 
##            700            688            666            719            716 
##      [162,185)      [185,440] 
##            694            715
wineAR$TotalSO2 <- cut(wineAR$TotalSO2, breaks = c(9, 94, 112,
                                                       126, 143, 162,
                                                       185, 441),
                         labels = c("[9, 94)", "[94, 112)", "[112, 126)",
                                    "[126, 143)", "[143, 162)", "[162, 185)",
                                    "[185, 440]"),
                         right = FALSE)

pHBreaks <- classIntervals(wineAR$pH, n = 7, style = "quantile")
pHBreaks
## style: quantile
##   one of 1,346,548,665 possible partitions of this variable into 7 classes
## [2.72,3.03)  [3.03,3.1)  [3.1,3.15)  [3.15,3.2)  [3.2,3.26) [3.26,3.35) 
##         628         686         684         707         740         739 
## [3.35,3.82] 
##         714
wineAR$pH <- cut(wineAR$pH, breaks = c(2.72, 3.03, 3.1,
                                       3.15, 3.2, 3.26,
                                       3.35, 3.83),
                         labels = c("[2.72, 3.03)", "[3.03, 3.1)", "[3.1, 3.15)",
                                    "[3.15, 3.2)", "[3.2, 3.26)", "[3.26, 3.35)",
                                    "[3.35, 3.82]"),
                         right = FALSE)

SulfatesBreaks <- classIntervals(wineAR$Sulfates, n = 7, style = "quantile")
SulfatesBreaks
## style: quantile
##   one of 256,851,595 possible partitions of this variable into 7 classes
##      [0.22,0.38)      [0.38,0.42)      [0.42,0.46) [0.46,0.4928571) 
##              649              672              736              742 
## [0.4928571,0.54)       [0.54,0.6)       [0.6,1.08] 
##              680              656              763
wineAR$Sulfates <- cut(wineAR$Sulfates, breaks = c(0.22, 0.38, 0.42,
                                                       0.46, 0.49, 0.54,
                                                       0.6, 1.09),
                         labels = c("[0.22, 0.38)", "[0.38, 0.42)", "[0.42, 0.46)",
                                    "[0.46, 0.49)", "[0.49, 0.54)", "[0.54, 0.6)",
                                    "[0.6, 1.08]"),
                         right = FALSE)

AlcBreaks <- classIntervals(wineAR$PercentAlc, n = 7, style = "quantile")
AlcBreaks
## style: quantile
##   one of 1,346,548,665 possible partitions of this variable into 7 classes
##     [8,9.2)   [9.2,9.5)  [9.5,10.1) [10.1,10.6) [10.6,11.2) [11.2,12.1) 
##         646         562         878         653         684         766 
## [12.1,14.2] 
##         709
wineAR$PercentAlc <- cut(wineAR$PercentAlc, breaks = c(8, 9.2, 9.5,
                                                       10.1, 10.6, 11.2,
                                                       12.1, 14.3),
                         labels = c("[8, 9.2)", "[9.2, 9.5)", "[9.5, 10.1)",
                                    "[10.1, 10.6)", "[10.6, 11.2)", "[11.2, 12.1)",
                                    "[12.1, 14.2]"),
                         right = FALSE)

Now all of the variables will be factors containing seven levels.

#Check the structure of the modified data frame.
str(wineAR)
## tibble [4,898 × 11] (S3: tbl_df/tbl/data.frame)
##  $ FixedAcidity   : Factor w/ 7 levels "[3.8, 6)","[6, 6.4)",..: 5 2 7 6 6 7 2 5 2 7 ...
##  $ VolatileAcidity: Factor w/ 7 levels "[0.08, 0.18)",..: 4 5 5 3 3 5 6 4 5 3 ...
##  $ CitricAcid     : Factor w/ 7 levels "[0, 0.24)","[0.24, 0.27)",..: 5 5 6 4 4 6 1 5 5 6 ...
##  $ ResidSugar     : Factor w/ 7 levels "[0.6, 1.4)","[1.4, 1.9)",..: 7 2 5 5 5 5 5 7 2 2 ...
##  $ Chlorides      : Factor w/ 7 levels "[0.009, 0.032)",..: 5 6 6 7 7 6 5 5 6 4 ...
##  $ FreeSO2        : Factor w/ 7 levels "[2, 18)","[18, 25)",..: 6 1 3 6 6 3 3 6 1 3 ...
##  $ TotalSO2       : Factor w/ 7 levels "[9, 94)","[94, 112)",..: 6 4 2 7 7 2 4 6 4 4 ...
##  $ pH             : Factor w/ 7 levels "[2.72, 3.03)",..: 1 6 6 4 4 6 4 1 6 5 ...
##  $ Sulfates       : Factor w/ 7 levels "[0.22, 0.38)",..: 3 5 3 2 2 3 4 3 5 3 ...
##  $ PercentAlc     : Factor w/ 7 levels "[8, 9.2)","[9.2, 9.5)",..: 1 3 4 3 3 4 3 1 3 5 ...
##  $ WineQuality    : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

The following plots will display the distribution of how many of the wine samples fall into each bin. Each of the plots should feature a somewhat uniform distribution, although there will not be a perfectly equal amount of observations in each bin.

#Visualization of discretized fixed acidity variable
wineAR %>%
  ggplot(aes(x = FixedAcidity)) +
  geom_bar(color = "black", fill = "lightblue") +
  ggtitle("Fig.25 Distribution of Discretized Fixed Acidity Variable") +
  labs(x = "Tartaric Acid in g/L", y = "Count of Wine Samples")

#Visualization of discretized volatile acidity variable
wineAR %>%
  ggplot(aes(x = VolatileAcidity)) +
  geom_bar(color = "black", fill = "lightblue") +
  ggtitle("Fig.26 Distribution of Discretized Volatile Acidity Variable") +
  labs(x = "Acetic Acid in g/L", y = "Count of Wine Samples")

#Visualization of discretized citric acid variable
wineAR %>%
  ggplot(aes(x = CitricAcid)) +
  geom_bar(color = "black", fill = "lightblue") +
  ggtitle("Fig.27 Distribution of Discretized Citric Acid Variable") +
  labs(x = "Citric Acid in g/L", y = "Count of Wine Samples")

#Visualization of discretized residual sugar variable
wineAR %>%
  ggplot(aes(x = ResidSugar)) +
  geom_bar(color = "black", fill = "lightblue") +
  ggtitle("Fig.28 Distribution of Discretized Residual Sugar Variable") +
  labs(x = "Residual Sugar in g/L", y = "Count of Wine Samples")

#Visualization of discretized chlorides variable
wineAR %>%
  ggplot(aes(x = Chlorides)) +
  geom_bar(color = "black", fill = "lightblue") +
  ggtitle("Fig.29 Distribution of Discretized Chlorides Variable") +
  labs(x = "Chlorides in g/L", y = "Count of Wine Samples")

#Visualization of discretized free SO2 variable
wineAR %>%
  ggplot(aes(x = FreeSO2)) +
  geom_bar(color = "black", fill = "lightblue") +
  ggtitle("Fig.30 Distribution of Discretized Free SO2 Variable") +
  labs(x = "Free SO2 in mg/L", y = "Count of Wine Samples")

#Visualization of discretized total SO2 variable
wineAR %>%
  ggplot(aes(x = TotalSO2)) +
  geom_bar(color = "black", fill = "lightblue") +
  ggtitle("Fig.31 Distribution of Discretized Total SO2 Variable") +
  labs(x = "Total SO2 in mg/L", y = "Count of Wine Samples")

#Visualization of discretized pH variable
wineAR %>%
  ggplot(aes(x = pH)) +
  geom_bar(color = "black", fill = "lightblue") +
  ggtitle("Fig.32 Distribution of Discretized pH Variable") +
  labs(x = "pH", y = "Count of Wine Samples")

#Visualization of discretized sulfates variable
wineAR %>%
  ggplot(aes(x = Sulfates)) +
  geom_bar(color = "black", fill = "lightblue") +
  ggtitle("Fig.33 Distribution of Discretized Sulfates Variable") +
  labs(x = "Potassium Sulfate in g/L", y = "Count of Wine Samples")

#Visualization of discretized percent alcohol variable
wineAR %>%
  ggplot(aes(x = PercentAlc)) +
  geom_bar(color = "black", fill = "lightblue") +
  ggtitle("Fig.34 Distribution of Discretized Percent Alcohol Variable") +
  labs(x = "Percent Alcohol", y = "Count of Wine Samples")

These visualizations confirm that the bins are apporoximately equal in size, albeit with some expected variations. This data set is now ready to be used for the association rule mining analysis.

Cleaning and Preparing the Data for Clustering Models

Clustering models, such as k-Means Clustering and Hierarchical Agglomerative Clustering, require the data set to be entirely numeric. Returning to the original wine data set, the 11 input variables are numeric and the output variable is a factor. Since clustering is an unsupervised machine learning technique, it does not require pre-defined classes to create its groupings. Therefore, the wine quality ratings will be removed from the data set.

First, another copy of the original data set is created before starting the data preparation process.

#Create a copy of the data set
wineCluster <- wine

Now, the wine quality column will be dropped from the data set. The wine quality column will be rejoined to the data set after each clustering model is created to visualize the results.

#Remove the wine quality column.
wineCluster <- wineCluster %>%
  select(-WineQuality)

Checking the structure of the data set confirms that the wine quality column has been removed.

#Check that wine quality column has been removed
str(wineCluster)
## tibble [4,898 × 11] (S3: tbl_df/tbl/data.frame)
##  $ FixedAcidity   : num [1:4898] 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ VolatileAcidity: num [1:4898] 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ CitricAcid     : num [1:4898] 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ ResidSugar     : num [1:4898] 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ Chlorides      : num [1:4898] 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ FreeSO2        : num [1:4898] 45 14 30 47 47 30 30 45 14 28 ...
##  $ TotalSO2       : num [1:4898] 170 132 97 186 186 97 136 170 132 129 ...
##  $ Density        : num [1:4898] 1.001 0.994 0.995 0.996 0.996 ...
##  $ pH             : num [1:4898] 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ Sulfates       : num [1:4898] 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ PercentAlc     : num [1:4898] 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...

Examining the first few rows of data shows that the wine quality ratings are no longer present.

#Print the first 6 rows of the data set
head(wineCluster)
## # A tibble: 6 × 11
##   FixedAcidity VolatileAcidity CitricAcid ResidSugar Chlorides FreeSO2 TotalSO2
##          <dbl>           <dbl>      <dbl>      <dbl>     <dbl>   <dbl>    <dbl>
## 1          7              0.27       0.36       20.7     0.045      45      170
## 2          6.3            0.3        0.34        1.6     0.049      14      132
## 3          8.1            0.28       0.4         6.9     0.05       30       97
## 4          7.2            0.23       0.32        8.5     0.058      47      186
## 5          7.2            0.23       0.32        8.5     0.058      47      186
## 6          8.1            0.28       0.4         6.9     0.05       30       97
## # … with 4 more variables: Density <dbl>, pH <dbl>, Sulfates <dbl>,
## #   PercentAlc <dbl>

Clustering models perform best when the data is normalized. Since the many of the distributions of the variables were skewed, the results of the clustering may be false if the data is left as it is. The normalize() function will be used.

wineCluster <- normalize(wineCluster, method = "standardize")

This will convert all variables to be on the same scale having approximate means of 0 and standard deviations of 1. This will aid in the distance calculations that clustering models use.

sapply(wineCluster, mean)
##    FixedAcidity VolatileAcidity      CitricAcid      ResidSugar       Chlorides 
##    2.359013e-14   -1.044682e-14    5.475489e-14   -2.580215e-15   -1.192640e-15 
##         FreeSO2        TotalSO2         Density              pH        Sulfates 
##    1.242144e-17   -1.568547e-16    2.147620e-12    1.327622e-14   -1.298984e-14 
##      PercentAlc 
##   -2.900525e-14
sapply(wineCluster, sd)
##    FixedAcidity VolatileAcidity      CitricAcid      ResidSugar       Chlorides 
##               1               1               1               1               1 
##         FreeSO2        TotalSO2         Density              pH        Sulfates 
##               1               1               1               1               1 
##      PercentAlc 
##               1

The data set is now cleaned and prepared for the clustering analysis.

Cleaning and Preparing the Data for Classification Models

Classification models are examples of supervised machine learning techniques. These models require a target class to predict based on the values of the input variables. In this case, the classes are the wine quality ratings. Classification models are very flexible in that they are able to take any type of variable as input. They only require that the class they are predicting is in the form of a factor variable.

The original wine data set will be copied into two new data sets that will be used for the classification models. Two data sets are required because the distance based classification models require normalized data. The techniques that require the normalized data are the k-Nearest neighbor and support vector machine models.

#Copy the wine data set for the classification models
wineClassify <- wine

#Copy and create data set for distance based classification models.
wineClassifyDist <- wine
wineClassifyDist <- wineClassifyDist %>%
  select(-WineQuality)
wineClassifyDist <- normalize(wineClassifyDist, method = "standardize")
wineClassifyDist$WineQuality <- wine$WineQuality

The wine quality variable was already changed to a factor variable in the wine data set so that change is preserved in the new data sets.

In order to assess the accuracy of the model’s predictions, each data set must be separated into two data sets, one used to train the model and one used to test the model. The data sets will be split so that 70% of the data is placed into the training data set and 30% of the data is placed into the testing data set. This ratio will provide plenty of data for the model to learn from and to be assessed on.

In order to create the training and testing data, the createDataPartition function will be used. This function guarantees there will be an equal balance of each output category in the training and testing data. This process will also guarantee that the same data examples will be in each train and test set, with the only difference being that one data set contains the original values and one data set contains the normalized values.

#Set seed to control randomization when forming training and testing data.
set.seed(12345)

#Creating a 70/30 split between training and testing data partitioning based on the wine quality rating.
trainList <- createDataPartition(y = wineClassify$WineQuality, p = 0.70, list = FALSE)
wineTrain <- wineClassify[trainList,]
wineTest <- wineClassify[-trainList,]

wineDistTrain <- wineClassifyDist[trainList,]
wineDistTest <- wineClassifyDist[-trainList,]

The following summary displays the percentage of wine examples that are in each wine quality rating category. This will be compared to the percentages found in the training and testing data.

round((summary(wineClassify$WineQuality) / nrow(wineClassify))*100,2)
##     3     4     5     6     7     8     9 
##  0.41  3.33 29.75 44.88 17.97  3.57  0.10

The following summary displays the percentage of the training data that is made up of wine samples with each wine quality rating.

#Percentage of training data by wine quality rating
round((summary(wineTrain$WineQuality) / nrow(wineTrain))*100,2)
##     3     4     5     6     7     8     9 
##  0.41  3.35 29.73 44.86 17.95  3.58  0.12
round((summary(wineDistTrain$WineQuality) / nrow(wineDistTrain))*100,2)
##     3     4     5     6     7     8     9 
##  0.41  3.35 29.73 44.86 17.95  3.58  0.12

The percentages of the wine samples by wine quality ratings are almost identical to those in the original data set before it was split into training and testing sets.

The following summary displays the percentage of the testing data that is made up of wine samples with each wine quality rating.

#Percentage of testing data by wine quality rating
round((summary(wineTest$WineQuality) / nrow(wineTest))*100,2)
##     3     4     5     6     7     8     9 
##  0.41  3.27 29.79 44.92 18.00  3.54  0.07
round((summary(wineDistTest$WineQuality) / nrow(wineDistTest))*100,2)
##     3     4     5     6     7     8     9 
##  0.41  3.27 29.79 44.92 18.00  3.54  0.07

The percentages for the testing data are extremely similar to those of the training data and the data set before it was split. This equal balance of the proportions of wine samples by wine quality rating in the training and test data is an ideal training and evaluation environment for the classification models.

The data set is now prepared for the classification analysis.

Models

Several types of analyses and models will be created in an effort to see if the wine quality can be predicted by the physiochemical properties. First, an association rule mining analysis will be run to identify common associations and itemsets present in the data. Next, there will be an unsupervised machine learning analysis featuring K-means clustering and hierarchical agglomerative clustering to see if the models are able to recreate the same groups as what would be formed by grouping the data by wine quality rating. Lastly, there will be a supervised machine learning analysis featuring decision trees, random forests, support vector machine models, k nearest neighbor models, and naive Bayes models. For each unsupervised and supervised machine learning technique that is used, a final model will be selected based on accuracy and that model will be highlighted in the results section and compared to the other final models.

Association Rule Mining Analysis

Association rule mining takes an algorithmic approach to frequent pattern analysis. It will find rules in a set of transactions that predict the occurrence of an attribute. In this case, the values of the physiochemical properties will be linked to the values of other properties those wine samples are likely to have.

Association rules are described in terms of itemsets. Each itemset is a collection of one or more items from the transaction data. In this case, each of the items is one of the bins that were created for each variable. Rules are primarily rated on three metrics: support, confidence, and lift.

Support describes the frequency of the itemset in the data. A support of 0.10 means that the items occur in 10% of all the itemsets. Confidence describes how often items appear together. A confidence of 0.90 means that there’s a 90% probability that items appear together given that one of the items appears. Lift describes how interesting the generated association rule is. Rules with a lift value of 1 describe itemsets where the appearance of one item does not affect the appearance of another item. Rules with a lift value greater than 1 describe itemsets where the appearance of one item increases the odds of the appearance of another item, but these pairs may not occur very often. Rules with a lift value less than 1 describe itemsets where each of the items occur very often, but almost never occur together.

This analysis will involve three tasks. The first task is to uncover the top 10 rules ordered by support, confidence, and lift. The second task is to choose an interesting left hand side and see what the top 10 rules generated are. The metric these rules will be ordered by will be based off which list in the first task provided the most interesting results. The third task is to set the right hand side equal to different groupings of the wine quality ratings to see what characteristics lead to what quality of wine.

Before generating the association rules, the wineAR data set must be converted to a transaction data set.

#Start Association Rule Mining
wineTransaction <- transactions(wineAR, format = "wide")
str(wineTransaction)
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:53878] 4 10 18 27 32 40 47 49 58 63 ...
##   .. .. ..@ p       : int [1:4899] 0 11 22 33 44 55 66 77 88 99 ...
##   .. .. ..@ Dim     : int [1:2] 77 4898
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  77 obs. of  3 variables:
##   .. ..$ labels   : chr [1:77] "FixedAcidity=[3.8, 6)" "FixedAcidity=[6, 6.4)" "FixedAcidity=[6.4, 6.7)" "FixedAcidity=[6.7, 6.9)" ...
##   .. ..$ variables: Factor w/ 11 levels "Chlorides","CitricAcid",..: 3 3 3 3 3 3 3 10 10 10 ...
##   .. ..$ levels   : Factor w/ 77 levels "[0, 0.24)","[0.009, 0.032)",..: 52 58 59 61 62 63 64 9 10 11 ...
##   ..@ itemsetInfo:'data.frame':  4898 obs. of  1 variable:
##   .. ..$ transactionID: chr [1:4898] "1" "2" "3" "4" ...

With the wine data converted to transaction data, the most frequently occurring items in the transactions can be visualized.

itemFrequencyPlot(wineTransaction, topN = 20, type = "absolute")
title("Most Frequent Items in Wine Transaction Data")

Unsurprisingly, wine quality ratings of 5 or 6 occur most often. There is an abundance of wines with these two ratings in the data set. The other items feature a mix of values for the different physiochemical properties.

Before making the three top 10 lists, the support and confidence thresholds of the apriori algorithm must be tuned to generate a sufficient number of rules. If between 50 and 100 rules are generated, this should be enough to see different top 10 lists when the rules are ordered by the different metrics. Since there are so many different wine samples with unique combinations of values for their physiochemical properites, the support value will start at 0.1 and the confidence value at 0.9. The confidence value will be kept as high as possible to focus on generating only meaningful rules.

wineRules <- apriori(wineTransaction, parameter = list(supp = 0.1, conf = 0.9))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 489 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[77 item(s), 4898 transaction(s)] done [0.00s].
## sorting and recoding items ... [73 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(wineRules)
## set of 0 rules

The thresholds were too strict and were not met by any of the rules. The support metric will be lowered to 0.05 for the next attempt at generating association rules.

wineRules <- apriori(wineTransaction, parameter = list(supp = 0.05, conf = 0.9))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5    0.05      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 244 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[77 item(s), 4898 transaction(s)] done [0.00s].
## sorting and recoding items ... [73 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(wineRules)
## set of 0 rules

These thresholds were still too strict. The support value will be lowered again, this time to 0.01.

wineRules <- apriori(wineTransaction, parameter = list(supp = 0.01, conf = 0.9))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 48 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[77 item(s), 4898 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(wineRules)
## set of 1 rules
## 
## rule length distribution (lhs + rhs):sizes
## 3 
## 1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01429   Min.   :0.9333   Min.   :0.01531   Min.   :6.512  
##  1st Qu.:0.01429   1st Qu.:0.9333   1st Qu.:0.01531   1st Qu.:6.512  
##  Median :0.01429   Median :0.9333   Median :0.01531   Median :6.512  
##  Mean   :0.01429   Mean   :0.9333   Mean   :0.01531   Mean   :6.512  
##  3rd Qu.:0.01429   3rd Qu.:0.9333   3rd Qu.:0.01531   3rd Qu.:6.512  
##  Max.   :0.01429   Max.   :0.9333   Max.   :0.01531   Max.   :6.512  
##      count   
##  Min.   :70  
##  1st Qu.:70  
##  Median :70  
##  Mean   :70  
##  3rd Qu.:70  
##  Max.   :70  
## 
## mining info:
##             data ntransactions support confidence
##  wineTransaction          4898    0.01        0.9
##                                                                        call
##  apriori(data = wineTransaction, parameter = list(supp = 0.01, conf = 0.9))

With a support value of 0.01, only one rule was generated. This may indicate the threshold is close to the value where many more rules will begin to meet the criteria. This time, the support value will be lowered to 0.005.

wineRules <- apriori(wineTransaction, parameter = list(supp = 0.005, conf = 0.9))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 24 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[77 item(s), 4898 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [24 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(wineRules)
## set of 24 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5 
##  2 17  5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.125   4.000   5.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.005104   Min.   :0.9000   Min.   :0.005308   Min.   :2.149  
##  1st Qu.:0.005461   1st Qu.:0.9279   1st Qu.:0.005717   1st Qu.:6.426  
##  Median :0.005819   Median :0.9499   Median :0.006125   Median :6.793  
##  Mean   :0.006389   Mean   :0.9518   Mean   :0.006720   Mean   :6.209  
##  3rd Qu.:0.006431   3rd Qu.:0.9739   3rd Qu.:0.006788   3rd Qu.:6.977  
##  Max.   :0.014292   Max.   :1.0000   Max.   :0.015312   Max.   :7.983  
##      count      
##  Min.   :25.00  
##  1st Qu.:26.75  
##  Median :28.50  
##  Mean   :31.29  
##  3rd Qu.:31.50  
##  Max.   :70.00  
## 
## mining info:
##             data ntransactions support confidence
##  wineTransaction          4898   0.005        0.9
##                                                                         call
##  apriori(data = wineTransaction, parameter = list(supp = 0.005, conf = 0.9))

This time, a set of 24 rules were generated. To increase the likelihood of seeing unique lists of rules when they ordered by the different metrics, a longer list would be better. The support value will be lowered one more time to 0.004.

wineRules <- apriori(wineTransaction, parameter = list(supp = 0.004, conf = 0.9))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 19 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[77 item(s), 4898 transaction(s)] done [0.00s].
## sorting and recoding items ... [76 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [108 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(wineRules)
## set of 108 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
##  2 39 57 10 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   5.000   4.694   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.004083   Min.   :0.9000   Min.   :0.004083   Min.   :2.043  
##  1st Qu.:0.004287   1st Qu.:0.9231   1st Qu.:0.004287   1st Qu.:6.091  
##  Median :0.004287   Median :0.9661   Median :0.004696   Median :6.695  
##  Mean   :0.004809   Mean   :0.9636   Mean   :0.005004   Mean   :6.303  
##  3rd Qu.:0.004900   3rd Qu.:1.0000   3rd Qu.:0.005104   3rd Qu.:6.977  
##  Max.   :0.014292   Max.   :1.0000   Max.   :0.015312   Max.   :7.983  
##      count      
##  Min.   :20.00  
##  1st Qu.:21.00  
##  Median :21.00  
##  Mean   :23.56  
##  3rd Qu.:24.00  
##  Max.   :70.00  
## 
## mining info:
##             data ntransactions support confidence
##  wineTransaction          4898   0.004        0.9
##                                                                         call
##  apriori(data = wineTransaction, parameter = list(supp = 0.004, conf = 0.9))

These parameters have generated a list of 108 rules. These rules will be ordered by support, confidence, and lift to see which are the top 10 according to each metric.

First, the top 10 rules by support will be displayed and a few of the interesting rules will be discussed.

#Ordering the rules by support.
wineRulesSupport <- sort(wineRules, decreasing = TRUE, by = "support")
#Displaying the top 10 rules.
arules::inspect(wineRulesSupport[1:10])
##      lhs                                rhs                           support confidence    coverage     lift count
## [1]  {PercentAlc=[8, 9.2),                                                                                         
##       WineQuality=7}                 => {ResidSugar=[12.8, 65.8]} 0.014291548  0.9333333 0.015312372 6.512061    70
## [2]  {FixedAcidity=[7.2, 7.7),                                                                                     
##       PercentAlc=[8, 9.2),                                                                                         
##       WineQuality=7}                 => {ResidSugar=[12.8, 65.8]} 0.008574929  0.9767442 0.008779094 6.814947    42
## [3]  {ResidSugar=[12.8, 65.8],                                                                                     
##       TotalSO2=[162, 185),                                                                                         
##       pH=[2.72, 3.03)}               => {PercentAlc=[8, 9.2)}     0.007554104  0.9250000 0.008166599 7.013390    37
## [4]  {pH=[2.72, 3.03),                                                                                             
##       PercentAlc=[8, 9.2),                                                                                         
##       WineQuality=7}                 => {ResidSugar=[12.8, 65.8]} 0.007349939  1.0000000 0.007349939 6.977208    36
## [5]  {ResidSugar=[12.8, 65.8],                                                                                     
##       pH=[2.72, 3.03),                                                                                             
##       WineQuality=7}                 => {PercentAlc=[8, 9.2)}     0.007349939  0.9729730 0.007554104 7.377123    36
## [6]  {FixedAcidity=[7.2, 7.7),                                                                                     
##       Chlorides=[0.055, 0.346],                                                                                    
##       TotalSO2=[185, 440],                                                                                         
##       PercentAlc=[8, 9.2)}           => {ResidSugar=[12.8, 65.8]} 0.006737444  0.9705882 0.006941609 6.771996    33
## [7]  {VolatileAcidity=[0.37, 1.1],                                                                                 
##       TotalSO2=[94, 112),                                                                                          
##       WineQuality=7}                 => {PercentAlc=[12.1, 14.2]} 0.006329114  0.9687500 0.006533279 6.692437    31
## [8]  {VolatileAcidity=[0.31, 0.37),                                                                                
##       CitricAcid=[0.45, 1.66],                                                                                     
##       PercentAlc=[9.2, 9.5)}         => {WineQuality=5}           0.006124949  0.9090909 0.006737444 3.056093    30
## [9]  {CitricAcid=[0, 0.24),                                                                                        
##       pH=[3.35, 3.82],                                                                                             
##       PercentAlc=[12.1, 14.2]}       => {FixedAcidity=[3.8, 6)}   0.005920784  0.9354839 0.006329114 7.982578    29
## [10] {VolatileAcidity=[0.22, 0.25),                                                                                
##       CitricAcid=[0.37, 0.45),                                                                                     
##       PercentAlc=[8, 9.2)}           => {ResidSugar=[12.8, 65.8]} 0.005920784  1.0000000 0.005920784 6.977208    29

The top rule by support reveals that having a percent alcohol between 8% and 9.2% and a wine quality rating of 7 means the wine will most likely have a residual sugar value between 12.8 and 65.8. This rule has a confidence of 93% meaning that when that left hand side occurs this right hand side occurs with it almost all of the time. If a wine has a low percentage of alcohol and is rated a 7, then it is almost always a sweeter wine containing a higher than average amount of residual sugar for this data set. This could imply that the wine experts have a preference towards sweeter tasting wines.

Another interesting rule is #6 which reveals that 97% of time when a wine sample has a fixed acidity value between 7.2 and 7.7, a chlorides value between 0.055 and 0.346, a total SO2 value between 185 and 440, and an alcohol percentage between 8% and 9.2%, that it will have a residual sugar value between 12.8 and 65.8. This rule provides some insight into the combinations of physiochemical properties. Wines with those properties are almost always sweeter tasting white wines.

A third interesting rule is #9 which reveals that 94% of the time when a wine sample has a citric acid value between 0 and 0.24, a pH value between 3.35 and 3.82, and a percentage of alcohol between 12.1 and 14.2, that the fixed acidity value will be between 3.8 and 6. This rule also gives some insight into the relationships between the properties of the wine. Wines that have a higher pH level, low level of citric acid, and high alcohol percentage tend to have a lower fixed acidity value.

Next, the top 10 rules by confidence will be displayed. Again, some interesting rules will be highlighted.

#Ordering the rules by confidence.
wineRulesConf <- sort(wineRules, decreasing = TRUE, by = "confidence")
#Displaying the top 10 rules.
arules::inspect(wineRulesConf[1:10])
##      lhs                                rhs                           support confidence    coverage     lift count
## [1]  {pH=[2.72, 3.03),                                                                                             
##       PercentAlc=[8, 9.2),                                                                                         
##       WineQuality=7}                 => {ResidSugar=[12.8, 65.8]} 0.007349939          1 0.007349939 6.977208    36
## [2]  {TotalSO2=[162, 185),                                                                                         
##       PercentAlc=[8, 9.2),                                                                                         
##       WineQuality=7}                 => {ResidSugar=[12.8, 65.8]} 0.004899959          1 0.004899959 6.977208    24
## [3]  {VolatileAcidity=[0.22, 0.25),                                                                                
##       CitricAcid=[0.37, 0.45),                                                                                     
##       PercentAlc=[8, 9.2)}           => {ResidSugar=[12.8, 65.8]} 0.005920784          1 0.005920784 6.977208    29
## [4]  {TotalSO2=[185, 440],                                                                                         
##       PercentAlc=[8, 9.2),                                                                                         
##       WineQuality=7}                 => {ResidSugar=[12.8, 65.8]} 0.005308289          1 0.005308289 6.977208    26
## [5]  {CitricAcid=[0.3, 0.33),                                                                                      
##       PercentAlc=[8, 9.2),                                                                                         
##       WineQuality=7}                 => {ResidSugar=[12.8, 65.8]} 0.005308289          1 0.005308289 6.977208    26
## [6]  {VolatileAcidity=[0.22, 0.25),                                                                                
##       CitricAcid=[0, 0.24),                                                                                        
##       FreeSO2=[36, 44)}              => {WineQuality=6}           0.004287464          1 0.004287464 2.228389    21
## [7]  {FixedAcidity=[6, 6.4),                                                                                       
##       VolatileAcidity=[0.18, 0.22),                                                                                
##       Sulfates=[0.49, 0.54)}         => {WineQuality=6}           0.004287464          1 0.004287464 2.228389    21
## [8]  {VolatileAcidity=[0.08, 0.18),                                                                                
##       ResidSugar=[12.8, 65.8],                                                                                     
##       pH=[2.72, 3.03),                                                                                             
##       PercentAlc=[8, 9.2)}           => {WineQuality=7}           0.004695794          1 0.004695794 5.565909    23
## [9]  {VolatileAcidity=[0.08, 0.18),                                                                                
##       pH=[2.72, 3.03),                                                                                             
##       PercentAlc=[8, 9.2),                                                                                         
##       WineQuality=7}                 => {ResidSugar=[12.8, 65.8]} 0.004695794          1 0.004695794 6.977208    23
## [10] {VolatileAcidity=[0.08, 0.18),                                                                                
##       ResidSugar=[12.8, 65.8],                                                                                     
##       pH=[2.72, 3.03),                                                                                             
##       WineQuality=7}                 => {PercentAlc=[8, 9.2)}     0.004695794          1 0.004695794 7.582043    23

The highest confidence rule is similar to the highest support rule. A pH value between 2.72 and 3.03, a percent alcohol between 8% and 9.2%, and a wine quality rating of 7 always means that the wine will have a high value of residual sugar, between 12.8 and 65.8.

Rule #6 may give some insight into how the expert judges make their decisions on wine quality. When a wine sample has a volatile acidity value between 0.22 and 0.25, a citric acid value between 0 and 0.24, and a free SO2 value between 36 and 44, the wine always has a quality rating of 6. There must be something about this combination of properties that always results in an average tasting wine. A wine with a rating of 6 is likely pleasant to drink, but does nothing to stand out from the crowd.

Similarly, rule #8 may reveal how to score a bit higher with the judges. A wine sample with a volatile acidity value between 0.08 and 0.18, residual sugar value between 12.8 and 65.8, pH value between 2.72 and 3.03, and alcohol percentage between 8% and 9.2% will always have a quality rating of 7. This rule reinforces the notion that the judges have a preference for sweeter wines with a lower pH value.

Lastly, the top 10 rules by lift will be displayed and then a few interesting rules will be discussed.

#Ordering the rules by lift.
wineRulesLift <- sort(wineRules, decreasing = TRUE, by = "lift")
#Displaying the top 10 rules.
arules::inspect(wineRulesLift[1:10])
##      lhs                                rhs                         support confidence    coverage     lift count
## [1]  {CitricAcid=[0, 0.24),                                                                                      
##       pH=[3.35, 3.82],                                                                                           
##       PercentAlc=[12.1, 14.2]}       => {FixedAcidity=[3.8, 6)} 0.005920784  0.9354839 0.006329114 7.982578    29
## [2]  {VolatileAcidity=[0.08, 0.18),                                                                              
##       ResidSugar=[12.8, 65.8],                                                                                   
##       PercentAlc=[8, 9.2),                                                                                       
##       WineQuality=7}                 => {pH=[2.72, 3.03)}       0.004695794  1.0000000 0.004695794 7.799363    23
## [3]  {VolatileAcidity=[0.08, 0.18),                                                                              
##       ResidSugar=[12.8, 65.8],                                                                                   
##       pH=[2.72, 3.03),                                                                                           
##       WineQuality=7}                 => {PercentAlc=[8, 9.2)}   0.004695794  1.0000000 0.004695794 7.582043    23
## [4]  {ResidSugar=[12.8, 65.8],                                                                                   
##       Chlorides=[0.055, 0.346],                                                                                  
##       FreeSO2=[44, 53),                                                                                          
##       pH=[2.72, 3.03)}               => {PercentAlc=[8, 9.2)}   0.004287464  1.0000000 0.004287464 7.582043    21
## [5]  {FixedAcidity=[7.2, 7.7),                                                                                   
##       ResidSugar=[12.8, 65.8],                                                                                   
##       FreeSO2=[44, 53),                                                                                          
##       pH=[2.72, 3.03)}               => {PercentAlc=[8, 9.2)}   0.004695794  1.0000000 0.004695794 7.582043    23
## [6]  {ResidSugar=[12.8, 65.8],                                                                                   
##       Chlorides=[0.055, 0.346],                                                                                  
##       TotalSO2=[162, 185),                                                                                       
##       pH=[2.72, 3.03)}               => {PercentAlc=[8, 9.2)}   0.004287464  1.0000000 0.004287464 7.582043    21
## [7]  {CitricAcid=[0.37, 0.45),                                                                                   
##       ResidSugar=[12.8, 65.8],                                                                                   
##       Chlorides=[0.055, 0.346],                                                                                  
##       TotalSO2=[185, 440]}           => {PercentAlc=[8, 9.2)}   0.004287464  1.0000000 0.004287464 7.582043    21
## [8]  {FixedAcidity=[7.2, 7.7),                                                                                   
##       CitricAcid=[0.37, 0.45),                                                                                   
##       ResidSugar=[12.8, 65.8],                                                                                   
##       Chlorides=[0.055, 0.346]}      => {PercentAlc=[8, 9.2)}   0.004083299  1.0000000 0.004083299 7.582043    20
## [9]  {FixedAcidity=[7.2, 7.7),                                                                                   
##       ResidSugar=[12.8, 65.8],                                                                                   
##       Chlorides=[0.049, 0.055),                                                                                  
##       WineQuality=7}                 => {PercentAlc=[8, 9.2)}   0.004287464  1.0000000 0.004287464 7.582043    21
## [10] {FixedAcidity=[7.2, 7.7),                                                                                   
##       ResidSugar=[12.8, 65.8],                                                                                   
##       Chlorides=[0.055, 0.346],                                                                                  
##       FreeSO2=[44, 53),                                                                                          
##       pH=[2.72, 3.03)}               => {PercentAlc=[8, 9.2)}   0.004287464  1.0000000 0.004287464 7.582043    21

The rule with the highest lift gives more information about the relationship between the properties. A citric acid value between 0 and 0.24, a pH value between 3.35 and 3.82, and an alcohol percentage between 12.1% and 14.2% results in a fixed acidity value between 3.8 and 6 with 94% confidence. It makes sense that the amount of citric acid and the pH level would be related to the fixed acidity level, since both of those properties also describe the acidity of the wine.

Eight of the ten rules have an alcohol percentage between 8% and 9.2% as the right hand side. Looking at all of these rules reveals that a fixed acidity value between 7.2 and 7.7, a residual sugar value between 12.8 and 65.8, and a pH value between 2.72 and 3.03 are often featured in these rules. This implies that these properties likely have some relationship to the final alcohol percentage of the wine.

The second task involves fixing the left hand side of the rules to specific items to see what right hand sides are generated. This ruleset will be ordered by confidence since that ordering gave a greater variety of rules in the first task.

Since a high residual sugar content was featured in many of the rules previously seen, it will be interesting to see what rules involve a low sugar content. Since these rules are targeting a specific left hand side, the support and confidence thresholds will be set to 0 to guarantee the existence of these rules. Since the rules will be ordered by confidence, the analysis will focus on the “best” rules anyway.

#Identifying Top 10 Rules with low residual sugar left hand side ordered by confidence
wineRulesSugar <- apriori(data = wineTransaction, parameter = list(supp=0, conf = 0), appearance = list(default = "rhs", lhs = "ResidSugar=[0.6, 1.4)"), control = list(verbose = F))
summary(wineRulesSugar)
## set of 152 rules
## 
## rule length distribution (lhs + rhs):sizes
##  1  2 
## 76 76 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.5     1.5     2.0     2.0 
## 
## summary of quality measures:
##     support          confidence        coverage           lift       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.1345   Min.   :0.0000  
##  1st Qu.:0.01817   1st Qu.:0.1222   1st Qu.:0.1345   1st Qu.:0.9320  
##  Median :0.03450   Median :0.1419   Median :1.0000   Median :1.0000  
##  Mean   :0.08033   Mean   :0.1373   Mean   :0.5882   Mean   :0.9596  
##  3rd Qu.:0.14434   3rd Qu.:0.1556   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :0.44875   Max.   :0.4488   Max.   :1.0000   Max.   :2.2784  
##                                     NA's   :7                        
##      count       
##  Min.   :   0.0  
##  1st Qu.:  89.0  
##  Median : 169.0  
##  Mean   : 393.5  
##  3rd Qu.: 707.0  
##  Max.   :2198.0  
##                  
## 
## mining info:
##             data ntransactions support confidence
##  wineTransaction          4898       0          0
##                                                                                                                                                                   call
##  apriori(data = wineTransaction, parameter = list(supp = 0, conf = 0), appearance = list(default = "rhs", lhs = "ResidSugar=[0.6, 1.4)"), control = list(verbose = F))
wineRulesSugar <- sort(wineRulesSugar, decreasing = TRUE, by = "confidence")
arules::inspect(wineRulesSugar[1:10])
##      lhs                        rhs                               support confidence  coverage      lift count
## [1]  {}                      => {WineQuality=6}                0.44875459  0.4487546 1.0000000 1.0000000  2198
## [2]  {ResidSugar=[0.6, 1.4)} => {WineQuality=6}                0.05961617  0.4430956 0.1345447 0.9873896   292
## [3]  {ResidSugar=[0.6, 1.4)} => {TotalSO2=[9, 94)}             0.04205798  0.3125948 0.1345447 2.2784070   206
## [4]  {ResidSugar=[0.6, 1.4)} => {WineQuality=5}                0.04205798  0.3125948 0.1345447 1.0508507   206
## [5]  {}                      => {WineQuality=5}                0.29746835  0.2974684 1.0000000 1.0000000  1457
## [6]  {ResidSugar=[0.6, 1.4)} => {FreeSO2=[2, 18)}              0.03246223  0.2412747 0.1345447 1.7638258   159
## [7]  {ResidSugar=[0.6, 1.4)} => {PercentAlc=[11.2, 12.1)}      0.03144140  0.2336874 0.1345447 1.4942571   154
## [8]  {ResidSugar=[0.6, 1.4)} => {VolatileAcidity=[0.18, 0.22)} 0.02837893  0.2109256 0.1345447 1.3557924   139
## [9]  {ResidSugar=[0.6, 1.4)} => {TotalSO2=[94, 112)}           0.02756227  0.2048558 0.1345447 1.4013742   135
## [10] {ResidSugar=[0.6, 1.4)} => {Chlorides=[0.032, 0.037)}     0.02756227  0.2048558 0.1345447 1.3820715   135

The two rules with empty left hand sides will be ignored. None of the rules with a left hand side containing a residual sugar value between 0.6 and 1.4 are particularly strong. The support values for these rules are very low and the highest confidence is only 44.3%. Regardless, these rules reveal that sometimes a low residual sugar level will lead to an average wine quality score of 5 or 6. The other rules relate those values of residual sugar to a particular value of one of the other properties.

The left hand side will now focus on values of alcohol percentage that are closer to the middle of the distribution. The previous rules revealed associations for the lowest and highest alcohol percentages, so it will be interesting to see how the “middle” percentages are associated to other items in the data. The middle bin of [10.1, 10.6) will be selected. The rules will again be sorted by confidence.

#Identifying Top 10 Rules with medium alcohol percentage left hand side ordered by confidence
wineRulesAlc <- apriori(data = wineTransaction, parameter = list(supp=0, conf = 0), appearance = list(default = "rhs", lhs = "PercentAlc=[10.1, 10.6)"), control = list(verbose = F))
summary(wineRulesAlc)
## set of 152 rules
## 
## rule length distribution (lhs + rhs):sizes
##  1  2 
## 76 76 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.5     1.5     2.0     2.0 
## 
## summary of quality measures:
##     support          confidence        coverage           lift       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.1333   Min.   :0.0000  
##  1st Qu.:0.01735   1st Qu.:0.1247   1st Qu.:0.1333   1st Qu.:0.9489  
##  Median :0.03267   Median :0.1409   Median :1.0000   Median :1.0000  
##  Mean   :0.08026   Mean   :0.1373   Mean   :0.5845   Mean   :0.9665  
##  3rd Qu.:0.14434   3rd Qu.:0.1537   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :0.44875   Max.   :0.5268   Max.   :1.0000   Max.   :1.6493  
##                                     NA's   :6                        
##      count       
##  Min.   :   0.0  
##  1st Qu.:  85.0  
##  Median : 160.0  
##  Mean   : 393.1  
##  3rd Qu.: 707.0  
##  Max.   :2198.0  
##                  
## 
## mining info:
##             data ntransactions support confidence
##  wineTransaction          4898       0          0
##                                                                                                                                                                     call
##  apriori(data = wineTransaction, parameter = list(supp = 0, conf = 0), appearance = list(default = "rhs", lhs = "PercentAlc=[10.1, 10.6)"), control = list(verbose = F))
wineRulesAlc <- sort(wineRulesAlc, decreasing = TRUE, by = "confidence")
arules::inspect(wineRulesAlc[1:10])
##      lhs                          rhs                               support confidence  coverage      lift count
## [1]  {PercentAlc=[10.1, 10.6)} => {WineQuality=6}                0.07023275  0.5267994 0.1333197 1.1739142   344
## [2]  {}                        => {WineQuality=6}                0.44875459  0.4487546 1.0000000 1.0000000  2198
## [3]  {}                        => {WineQuality=5}                0.29746835  0.2974684 1.0000000 1.0000000  1457
## [4]  {PercentAlc=[10.1, 10.6)} => {WineQuality=5}                0.03613720  0.2710567 0.1333197 0.9112118   177
## [5]  {PercentAlc=[10.1, 10.6)} => {pH=[3.35, 3.82]}              0.03205390  0.2404288 0.1333197 1.6493280   157
## [6]  {PercentAlc=[10.1, 10.6)} => {ResidSugar=[1.4, 1.9)}        0.02878726  0.2159265 0.1333197 1.5087132   141
## [7]  {PercentAlc=[10.1, 10.6)} => {VolatileAcidity=[0.18, 0.22)} 0.02817477  0.2113323 0.1333197 1.3584064   138
## [8]  {PercentAlc=[10.1, 10.6)} => {Sulfates=[0.6, 1.08]}         0.02633728  0.1975498 0.1333197 1.2681504   129
## [9]  {PercentAlc=[10.1, 10.6)} => {Chlorides=[0.045, 0.049)}     0.02592895  0.1944870 0.1333197 1.3686742   127
## [10] {PercentAlc=[10.1, 10.6)} => {Sulfates=[0.54, 0.6)}         0.02572479  0.1929556 0.1333197 1.4406959   126

Again, the rules with blank left hand sides will be ignored. This time the top rule has slightly higher confidence at 52.7%. This rule forms an association between a middle alcohol percentage and a rating of 6. The other rules reveal some of the relationships between a middle alcohol percentage and the other properties, although all of these confidence levels are very low. Sometimes a middle alcohol percentage can result in a low level of residual sugar or a high level of sulfates.

The third task is to mine association rules where the right hand side is one of the quality ratings. For this analysis, three groups of rules will be created. One where the right hand side is set to be “good” wines, or those rated a 7, 8, or 9, “average” wines, or those rated a 5 or a 6, and “poor” wines, or those rated a 3 or a 4. The intention is to analyze the items that the left hand sides of the rules have in common to reveal possible patterns of the values of properties that lead to the wines being rated differently. Each rule set will be ordered by confidence and the top 5 rules displayed.

For each of the rule sets, the support and confidence thresholds will be raised or lowered based on how many wine samples in the data set fall into each of the “poor”, “average”, or “good” quality groups.

The first rule set generated will be for the “good” wines, which are those rated a 7 or higher.

#Identifying Top 5 Rules to Identify Highly Rated Wine by Confidence
wineRulesGood <- apriori(data = wineTransaction, parameter = list(supp=0.0018, conf = 
                                                                   0.9),
                        appearance = list(default="lhs", rhs=c("WineQuality=7", "WineQuality=8", "WineQuality=9")),
                        control = list(verbose = F))
summary(wineRulesGood)
## set of 318 rules
## 
## rule length distribution (lhs + rhs):sizes
##   4   5   6   7   8 
##   5 149 117  41   6 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   5.000   6.000   5.667   6.000   8.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001837   Min.   :0.9000   Min.   :0.001837   Min.   : 5.009  
##  1st Qu.:0.001837   1st Qu.:1.0000   1st Qu.:0.001837   1st Qu.: 5.566  
##  Median :0.001837   Median :1.0000   Median :0.002042   Median : 5.566  
##  Mean   :0.002190   Mean   :0.9814   Mean   :0.002232   Mean   : 7.331  
##  3rd Qu.:0.002858   3rd Qu.:1.0000   3rd Qu.:0.002858   3rd Qu.: 5.566  
##  Max.   :0.004696   Max.   :1.0000   Max.   :0.004696   Max.   :27.989  
##      count      
##  Min.   : 9.00  
##  1st Qu.: 9.00  
##  Median : 9.00  
##  Mean   :10.73  
##  3rd Qu.:14.00  
##  Max.   :23.00  
## 
## mining info:
##             data ntransactions support confidence
##  wineTransaction          4898  0.0018        0.9
##                                                                                                                                                                                                       call
##  apriori(data = wineTransaction, parameter = list(supp = 0.0018, conf = 0.9), appearance = list(default = "lhs", rhs = c("WineQuality=7", "WineQuality=8", "WineQuality=9")), control = list(verbose = F))
wineRulesGood <- sort(wineRulesGood, decreasing = TRUE, by = "confidence")
arules::inspect(wineRulesGood[1:5])
##     lhs                                rhs                 support confidence    coverage      lift count
## [1] {VolatileAcidity=[0.08, 0.18),                                                                       
##      FreeSO2=[31, 36),                                                                                   
##      PercentAlc=[8, 9.2)}           => {WineQuality=7} 0.002041650          1 0.002041650  5.565909    10
## [2] {VolatileAcidity=[0.08, 0.18),                                                                       
##      CitricAcid=[0.27, 0.3),                                                                             
##      pH=[2.72, 3.03)}               => {WineQuality=7} 0.003470804          1 0.003470804  5.565909    17
## [3] {VolatileAcidity=[0.18, 0.22),                                                                       
##      TotalSO2=[143, 162),                                                                                
##      pH=[2.72, 3.03),                                                                                    
##      PercentAlc=[8, 9.2)}           => {WineQuality=8} 0.001837485          1 0.001837485 27.988571     9
## [4] {VolatileAcidity=[0.18, 0.22),                                                                       
##      CitricAcid=[0.27, 0.3),                                                                             
##      pH=[2.72, 3.03),                                                                                    
##      PercentAlc=[8, 9.2)}           => {WineQuality=8} 0.001837485          1 0.001837485 27.988571     9
## [5] {VolatileAcidity=[0.18, 0.22),                                                                       
##      CitricAcid=[0.27, 0.3),                                                                             
##      FreeSO2=[44, 53),                                                                                   
##      PercentAlc=[8, 9.2)}           => {WineQuality=8} 0.001837485          1 0.001837485 27.988571     9

The top five rules for the good wines exhibit many strong similarities between them. All of these rules have 100% confidence, so every time this left hand side appears it has the given right hand side. All of these rules describe the association between the items describing the wine properties and a wine quality rating. The volatile acidity variable shows up in all five rules and these items represent the lowest 2 bins for that variable. The citric acid variable appears in three rules and the item is always one of the middle bins with values between 0.27 and 0.3. The pH variable is in three of the rules and the item is always the lowest bin. Lastly, the percent alcohol variable shows up in four of the rules and the item is always the lowest bin. These rules can be summarized with the following general association: low volatile acidity, low pH, low percent alcohol, and medium levels of citric acid tend to result in highly rated wines, receiving a score of 7 or 8.

The second rule set generated will be for the “average” wines, which are those rated a 5 or a 6.

#Identifying Top 5 Rules to Identify Average Wine by Confidence
wineRulesAvg <- apriori(data = wineTransaction, parameter = list(supp=0.003, conf = 
                                                                   0.9),
                        appearance = list(default="lhs", rhs=c("WineQuality=5", "WineQuality=6")),
                        control = list(verbose = F))
summary(wineRulesAvg)
## set of 47 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
##  1 36  9  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.213   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.003062   Min.   :0.9000   Min.   :0.003062   Min.   :2.006  
##  1st Qu.:0.003062   1st Qu.:0.9375   1st Qu.:0.003267   1st Qu.:2.097  
##  Median :0.003267   Median :0.9412   Median :0.003471   Median :2.228  
##  Mean   :0.003627   Mean   :0.9509   Mean   :0.003827   Mean   :2.552  
##  3rd Qu.:0.003777   3rd Qu.:1.0000   3rd Qu.:0.004185   3rd Qu.:3.152  
##  Max.   :0.006125   Max.   :1.0000   Max.   :0.006737   Max.   :3.362  
##      count      
##  Min.   :15.00  
##  1st Qu.:15.00  
##  Median :16.00  
##  Mean   :17.77  
##  3rd Qu.:18.50  
##  Max.   :30.00  
## 
## mining info:
##             data ntransactions support confidence
##  wineTransaction          4898   0.003        0.9
##                                                                                                                                                                                     call
##  apriori(data = wineTransaction, parameter = list(supp = 0.003, conf = 0.9), appearance = list(default = "lhs", rhs = c("WineQuality=5", "WineQuality=6")), control = list(verbose = F))
wineRulesAvg <- sort(wineRulesAvg, decreasing = TRUE, by = "confidence")
arules::inspect(wineRulesAvg[1:5])
##     lhs                                rhs                 support confidence    coverage     lift count
## [1] {FixedAcidity=[6.7, 6.9),                                                                           
##      ResidSugar=[1.9, 4.1),                                                                             
##      pH=[3.1, 3.15)}                => {WineQuality=6} 0.003062474          1 0.003062474 2.228389    15
## [2] {FixedAcidity=[6.7, 6.9),                                                                           
##      VolatileAcidity=[0.31, 0.37),                                                                      
##      Chlorides=[0.049, 0.055)}      => {WineQuality=5} 0.003062474          1 0.003062474 3.361702    15
## [3] {VolatileAcidity=[0.31, 0.37),                                                                      
##      pH=[2.72, 3.03),                                                                                   
##      PercentAlc=[9.2, 9.5)}         => {WineQuality=5} 0.003470804          1 0.003470804 3.361702    17
## [4] {FixedAcidity=[6.9, 7.2),                                                                           
##      FreeSO2=[2, 18),                                                                                   
##      PercentAlc=[9.2, 9.5)}         => {WineQuality=5} 0.003062474          1 0.003062474 3.361702    15
## [5] {FreeSO2=[36, 44),                                                                                  
##      pH=[3.2, 3.26),                                                                                    
##      Sulfates=[0.54, 0.6)}          => {WineQuality=6} 0.003062474          1 0.003062474 2.228389    15

The top five rules for the average wines are much more varied than they were for the good wines. Each of these rules also has 100% confidence. Fixed acidity shows up in three of the rules and the items are the middle bins for this variable, [6.7, 6.9) and [6.9, 7.2). Percent alcohol is present in two of the rules with the item representing the second lowest bin both times. Other than those items, the other left hand side items vary in terms of what variable is present, or what bin is present for the variable. This leads to less certainty when trying to summarize the properties that lead to an average quality wine. A possible explanation is that average rated wines would have some good qualities and some bad qualities, and which ones are good or bad for each wine may differ. Those good and bad qualities then average out to produce an average tasting wine.

The final rule set will be for the “poor” wines, those rated a 4 or lower.

#Identifying Top 5 rules to Identify Poorly Rated Wine by Confidence
wineRulesPoor <- apriori(data = wineTransaction, parameter = list(supp=0.0008, conf = 
                                                                    0.80),
                         appearance = list(default="lhs", rhs=c("WineQuality=3", "WineQuality=4")),
                         control = list(verbose = F))
summary(wineRulesPoor)
## set of 32 rules
## 
## rule length distribution (lhs + rhs):sizes
##  5  6  7 
## 21 10  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   5.000   5.000   5.375   6.000   7.000 
## 
## summary of quality measures:
##     support            confidence        coverage              lift      
##  Min.   :0.0008167   Min.   :0.8000   Min.   :0.0008167   Min.   :24.04  
##  1st Qu.:0.0008167   1st Qu.:0.8000   1st Qu.:0.0008167   1st Qu.:24.04  
##  Median :0.0008167   Median :0.8167   Median :0.0010208   Median :24.54  
##  Mean   :0.0008549   Mean   :0.8799   Mean   :0.0009825   Mean   :26.44  
##  3rd Qu.:0.0008167   3rd Qu.:1.0000   3rd Qu.:0.0010208   3rd Qu.:30.05  
##  Max.   :0.0012250   Max.   :1.0000   Max.   :0.0014292   Max.   :30.05  
##      count      
##  Min.   :4.000  
##  1st Qu.:4.000  
##  Median :4.000  
##  Mean   :4.188  
##  3rd Qu.:4.000  
##  Max.   :6.000  
## 
## mining info:
##             data ntransactions support confidence
##  wineTransaction          4898   8e-04        0.8
##                                                                                                                                                                                     call
##  apriori(data = wineTransaction, parameter = list(supp = 8e-04, conf = 0.8), appearance = list(default = "lhs", rhs = c("WineQuality=3", "WineQuality=4")), control = list(verbose = F))
wineRulesPoor <- sort(wineRulesPoor, decreasing = TRUE, by = "confidence")
arules::inspect(wineRulesPoor[1:5])
##     lhs                               rhs                  support confidence     coverage     lift count
## [1] {FixedAcidity=[6.4, 6.7),                                                                            
##      ResidSugar=[4.1, 6.6),                                                                              
##      FreeSO2=[2, 18),                                                                                    
##      PercentAlc=[8, 9.2)}          => {WineQuality=4} 0.0008166599          1 0.0008166599 30.04908     4
## [2] {VolatileAcidity=[0.37, 1.1],                                                                        
##      CitricAcid=[0, 0.24),                                                                               
##      Chlorides=[0.055, 0.346],                                                                           
##      Sulfates=[0.22, 0.38)}        => {WineQuality=4} 0.0008166599          1 0.0008166599 30.04908     4
## [3] {VolatileAcidity=[0.37, 1.1],                                                                        
##      CitricAcid=[0, 0.24),                                                                               
##      ResidSugar=[0.6, 1.4),                                                                              
##      Sulfates=[0.38, 0.42)}        => {WineQuality=4} 0.0010208248          1 0.0010208248 30.04908     5
## [4] {CitricAcid=[0, 0.24),                                                                               
##      FreeSO2=[2, 18),                                                                                    
##      pH=[3.35, 3.82],                                                                                    
##      PercentAlc=[11.2, 12.1)}      => {WineQuality=4} 0.0008166599          1 0.0008166599 30.04908     4
## [5] {FixedAcidity=[6, 6.4),                                                                              
##      CitricAcid=[0, 0.24),                                                                               
##      Chlorides=[0.032, 0.037),                                                                           
##      FreeSO2=[2, 18)}              => {WineQuality=4} 0.0008166599          1 0.0008166599 30.04908     4

The top five rules for the poor wines are somewhat varied, but less so than the average wines. Again, all five of these rules have 100% confidence. The citric acid variable appears in four of the rules with the item representing the lowest bin each time. The free SO2 variable appears three times, with the item representing its lowest bin each time as well. It is also noteworthy that one of the rules contains the lowest percent alcohol bin and another rule contains the second highest percent alcohol bin. The other items in the left hand side tend to vary from rule to rule. For example, in the first rule the residual sugar is much higher than that of the third rule. The lowest levels of citric acid and free SO2 tend to result in poorly rated wines. Since the amount of free SO2 determines how resistant the wine is to spoiling, it is possible that these wines were rated poorly because their quality degraded from what it was originally. Although, there may also be many other combinations of factors that can explain why a wine was rated poorly.

Clustering by K-Means

A clustering analysis is the process of finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups. Since this is an unsupervised machine learning technique, there are no predefined classes.

The K-Means clustering algorithm involves finding a centroid, which is the center of the cluster determined by finding the average of all the data examples in the cluster. The algorithm starts by randomly picking k number of points as centroids and forming k number of clusters based on how close each point is to each centroid. Then the centroid of each cluster is recalculated. This process repeats until the centroids do not change from one step to the next.

The quality of the models created using this method can be assessed by looking at the ratio of the intra-cluster variance to the inter-cluster variance. Intra-cluster variance measures the distance between the centroid and each point of the cluster and is represented by the within sum of squares value. Inter-cluster variance measures the external separation of the clusters and is represented by the between sum of squares value. Better clustering occurs when the within sum of squares is minimized and the between sum of squares is maximized.

(reference: https://towardsdatascience.com/explain-ml-in-a-simple-way-k-means-clustering-e925d019743b)

Before starting, it is important to point out a weakness of the K-Means algorithm. As the number of clusters is increased, the sum of squares error is consequently decreased. This means that if the number of clusters is the same as the number of data points then the sum of squares measures will signify a perfect clustering, even though this would tell us nothing new about the data set.

Several models will be created, each targeting a different number of clusters. The first model will feature 3 clusters, similar to how the wine quality ratings were broken into 3 groups (poor, average, and good) for the association rule mining analysis. It will be interesting to see if the K-Means clustering will be able to recognize and correctly classify these 3 groups.

#Set seed to control randomization
set.seed(12345)

#Run k-means to create a model with 3 clusters
KM3 <- kmeans(wineCluster, 3)
#Show the percentage of the total sum of squares that is the between sum of squares
KM3$betweenss / (KM3$tot.withinss + KM3$betweenss)
## [1] 0.2674026

The total sum of squares is the sum of the between sum of squares and within sum of squares. In this model featuring three clusters, the between sum of squares is only 26.7% of the total sum of squares, meaning the within sum of squares is 73.3% of the total sum of squares. Since the between sum of squares is a small percent of the total, there is little external separation between the clusters. This can be seen in the following visualization.

There are two massive groups of data points with two fairly distinct clusters and a third cluster that is totally encompassed by the other two.

#Adding the cluster assignments to the data frame as a factor variable.
wineKM3 <- wineCluster
wineKM3$Clusters <- as.factor(KM3$cluster)

#Visualize the clusters
clusplot(wineKM3, wineKM3$Clusters, color = TRUE, 
         shade = TRUE, labels = 0, lines = 0)

#Adding the cluster column to the version of the data frame where the wine quality rating is included. This way the clusters can be shown by wine quality to evaluate the model.
wineKM3Bar <- wine
wineKM3Bar$Clusters <- as.factor(KM3$cluster)
wineKM3Bar %>%
  ggplot(aes(x = WineQuality, fill = Clusters)) +
  geom_bar(stat="count") +
  labs(title = "Cluster Assignments by Wine Quality") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 15)) 

The 3 cluster model was not able to create “poor”, “average”, and “good” quality wine groups. The first cluster is comprised of wine samples from every wine quality rating. The second cluster is comprised of wine samples from every wine quality rating as well. Lastly, the third cluster is split between wine samples that were rated a 5 or a 6. Since all the samples of each quality rating are spread among all the clusters, this model did not identify patterns in the physiochemical properties that could lead to distinct groupings based on their quality ratings.

The next model will feature seven clusters since there seven different quality ratings.

#Set seed to control randomization
set.seed(12345)

#Run k-means to create a model with 7 clusters
KM7 <- kmeans(wineCluster, 7)
#Show the percentage of the total sum of squares that is the between sum of squares
KM7$betweenss / (KM7$tot.withinss + KM7$betweenss)
## [1] 0.4269024

This time the between sum of squares is 42.6% of the total sum of squares. This means that there is more of a separation between the clusters in this model compared to the previous model. Although, this metric will naturally increase as the number of clusters increase regardless of how well the model is performing. This means that this metric will give an inflated sense of “goodness” of how well the model performs.

This time, the data examples are forming a large cloud without too much of a distinct shape, and all of the clusters are overlapping with one another to varying degrees.

#Adding the cluster assignments to the data frame as a factor variable.
wineKM7 <- wineCluster
wineKM7$Clusters <- as.factor(KM7$cluster)

#Visualize the clusters
clusplot(wineKM7, wineKM7$Clusters, color = TRUE, 
         shade = TRUE, labels = 0, lines = 0)

#Adding the cluster column to the version of the data frame where the wine quality rating is included. This way the clusters can be shown by wine quality to evaluate the model.
wineKM7Bar <- wine
wineKM7Bar$Clusters <- as.factor(KM7$cluster)
wineKM7Bar %>%
  ggplot(aes(x = WineQuality, fill = Clusters)) +
  geom_bar(stat="count") +
  labs(title = "Cluster Assignments by Wine Quality") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 15)) 

Increasing the number of clusters to seven did not improve the model’s ability to group the wine samples to match the wine quality ratings. Once again, many of the clusters contain samples from every wine quality rating. This model does not allow any strong conclusions to be made.

The next model will feature ten clusters.

#Set seed to control randomization
set.seed(12345)

#Run k-means to create a model with 10 clusters
KM10 <- kmeans(wineCluster, 10)
#Show the percentage of the total sum of squares that is the between sum of squares
KM10$betweenss / (KM10$tot.withinss + KM10$betweenss)
## [1] 0.4901618

This time, the between sum of squares has increased to 49%. This is a 7% increase over the previous model. Even with three more clusters, there is not a great degree of separation.

The data examples are forming a single large cloud with all of the clusters intersecting into the cloud.

#Adding the cluster assignments to the data frame as a factor variable.
wineKM10 <- wineCluster
wineKM10$Clusters <- as.factor(KM10$cluster)

#Visualize the clusters
clusplot(wineKM10, wineKM10$Clusters, color = TRUE, 
         shade = TRUE, labels = 0, lines = 0)

#Adding the cluster column to the version of the data frame where the wine quality rating is included. This way the clusters can be shown by wine quality to evaluate the model.
wineKM10Bar <- wine
wineKM10Bar$Clusters <- as.factor(KM10$cluster)
wineKM10Bar %>%
  ggplot(aes(x = WineQuality, fill = Clusters)) +
  geom_bar(stat="count") +
  labs(title = "Cluster Assignments by Wine Quality") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 15)) 

The ten cluster model has very similar results compared to the last two models. Again, the clusters are not being formed in a way where each cluster would represent a particular wine quality rating. Each wine quality rating is featured in almost all of the ten clusters.

The last k-Means clustering model will feature 14 clusters.

#Set seed to control randomization
set.seed(12345)

#Run k-means to create a model with 3 clusters
KM14 <- kmeans(wineCluster, 14)
#Show the percentage of the total sum of squares that is the between sum of squares
KM14$betweenss / (KM14$tot.withinss + KM14$betweenss)
## [1] 0.5311522

This 14 cluster model features only a very modest increase in the between sum of squares percentage compared to the last model. This means similar results to the previous model should be expected.

Once again, the data examples form a large cloud and all of the clusters intersect into that cloud just like with the seven and ten cluster models.

#Adding the cluster assignments to the data frame as a factor variable.
wineKM14 <- wineCluster
wineKM14$Clusters <- as.factor(KM14$cluster)

#Visualize the clusters
clusplot(wineKM14, wineKM14$Clusters, color = TRUE, 
         shade = TRUE, labels = 0, lines = 0)

#Adding the cluster column to the version of the data frame where the wine quality rating is included. This way the clusters can be shown by wine quality to evaluate the model.
wineKM14Bar <- wine
wineKM14Bar$Clusters <- as.factor(KM14$cluster)
wineKM14Bar %>%
  ggplot(aes(x = WineQuality, fill = Clusters)) +
  geom_bar(stat="count") +
  labs(title = "Cluster Assignments by Wine Quality") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 15)) 

Using fourteen clusters only increased how many times the wine samples from each wine quality rating category are broken up into separate clusters. Just like with the previous models, wine samples from each wine quality rating are showing up in many, if not all, of the clusters.

K-Means Clustering Model Comparison and Selection

None of the k-Means clustering models created clusters that would appear to be related to each wine sample’s quality rating. This may mean that k-Means clustering is not a suitable technique to use on this particular data set. It may also mean that there is such a small difference between the physiochemical properties that lead to a certain quality rating, that the algorithm was not able to pick up on them.

For the purposes of this analysis, the seven cluster model will be chosen as the final model. Having more clusters did not improve the model and only made the visualization harder to understand. None of these models categorized the data in the way that was hoped for, so which model is selected does not matter too much. The results of these models are still meaningful, however, just not towards a conclusion of the quality of the wine samples being distinctly separated by the properties contained in this data set.

Hierarchical Agglomerative Clustering (HAC)

The Hierarchical Agglomerative Clustering algorithm creates a dendrogram, which is a tree diagram of clusters, made up of several levels of nested partitioning. The clusters are determined using a selected distance measure. The advantage of using HAC with a dendrogram output is that it reveals exactly which data example is in each cluster. Also, within each cluster, the data examples that are closest to one another are most similar. This works best, however, when the data set is small enough that all the results are visible on the single plot. Since this data set contains almost 5,000 data examples, a dendrogram would be completely unreadable. To overcome this limitation, the data will be clustered using the HAC and a colored dendrogram will be displayed. This colored dendrogram will then be converted into a bar plot like those used for the k-Means clustering analysis.

The results of the HAC models will differ based on which distance measure is used. Four different models will be created, each featuring a different distance measure. These models will be assessed in the same way as the k-Means models, by seeing if the models are able to recreate the wine quality rating categories when they cluster the wine samples.

The four distance calculation methods that will be used in this analysis are Euclidean, Cosine, Manhattan, and Minkowski. Euclidean distance is the “straight-line” physical distance between the data entries. Cosine distance measures the size of the angle between the two data points measured from the origin. Manhattan distance measures the distance from one data point to another but only by traveling along square blocks. Minkowski distance is measured by calculating the distance between data points in an n-dimensional space. The Minkowski distance requires a value of p that determines its specific calculation method. A value of 1.5 will be used because that calculation results in one that is between the Euclidean and Manhattan methods.

set.seed(12345)
#Calculate each of the distance measures to be used for HAC.
distEuc <- dist(wineCluster, method = "euclidean")
distCos <- dist(wineCluster, method = "cosine")
distMan <- dist(wineCluster, method = "manhattan")
distMin <- dist(wineCluster, method = "minkowski", p = 1.5)

Each clustering will be performed using the complete linkage method. This method calculates the distance between clusters by finding the distance between the two data points in each cluster that are the farthest away from one another. This method produces clusters that are more compact than other linkage methods. This is advantageous for this task because the previous analysis has shown that many of the clusters are similar. By favoring smaller clusters over the larger clusters, this model may be able to uncover some patterns in the data that lead to particular wine quality ratings.

(reference: https://www.r-bloggers.com/2017/12/how-to-perform-hierarchical-clustering-using-r/)

The first model will use the Euclidean distance measure.

set.seed(12345)
#Run the first HAC using Euclidean distance and complete linkage.
HAC1 <- hclust(distEuc, method = "complete")

The model will be visualized using a colored dendrogram. Although the labels at the end of the branches are unreadable due to the volume of data, this visualization gives an initial impression of the number of and size of the clusters created by the model. Setting the h parameter to 12 tells the model at which heigh to cut the tree for the purpose of creating and coloring the clusters. An h value of 12 will allow the model to create a similar number of clusters as the k-Means models if it finds it appropriate to do so.

#Plot the colored dendrogram based on the HAC.
dendrogram1 <- as.dendrogram(HAC1)
dendrogram_color1 <- color_branches(dendrogram1, h=12)
plot(dendrogram_color1)

The dendrogram features five clusters each represented by a different color. The blue and pink clusters contain the great majority of the wine samples, and there appears to be a cluster of a single wine sample that is colored red.

The cutree function is used to cut the trees at the same height as was done in the color dendrogram. These cut levels form the clusters and that information is passed to a data frame. This data frame also has the wine quality rating information added back into it. The clusters and wine quality ratings can now be displayed in the same style of bar chart that was used in the k-Means analysis.

#Using cutree to cut the tree at the same heights as the color dendrogram.
cuts <- cutree(HAC1, h = 12)
#Copying wineCluster to a new results data frame.
wineClusterResult <- wineCluster
#Adding the wine quality rating back into the data set.
wineClusterResult$WineQuality <- wine$WineQuality
#Adding the cluster assignments to the data set based on the level at which the trees were cut.
wineClusterResult <- mutate(wineClusterResult, cluster = cuts)
wineClusterResult$cluster <- as.factor(wineClusterResult$cluster)
#Plotting the bar plot of the cluster assignments by wine quality.
wineClusterResult %>%
  ggplot(aes(x = WineQuality, fill = cluster)) +
  geom_bar(stat="count") +
  labs(title = "Cluster Assignments by Wine Quality") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 15))

The bar plot reveals that the model placed wine samples into five of the available eight clusters. The wine samples by quality rating are distributed throughout the clusters. There is no cluster that represents a single quality rating or even a group of similar quality ratings. The Euclidean distance metric did not achieve the clustering results that were hoped for.

The next model will use the cosine distance measure.

set.seed(12345)
#Run the second HAC using cosine distance and complete linkage.
HAC2 <- hclust(distCos, method = "complete")
#Plot the colored dendrogram based on the HAC.
dendrogram2 <- as.dendrogram(HAC2)
dendrogram_color2 <- color_branches(dendrogram2, h=4)
plot(dendrogram_color2)

Even with the h value reduced to 4, which gives the model the opportunity to make more smaller clusters, the model created only a single cluster containing every wine sample. The model provides no insight related to the data set.

#Using cutree to cut the tree at the same heights as the color dendrogram.
cuts2 <- cutree(HAC2, h = 4)
#Copying wineCluster to a new results data frame.
wineClusterResult2 <- wineCluster
#Adding the wine quality rating back into the data set.
wineClusterResult2$WineQuality <- wine$WineQuality
#Adding the cluster assignments to the data set based on the level at which the trees were cut.
wineClusterResult2 <- mutate(wineClusterResult2, cluster = cuts2)
wineClusterResult2$cluster <- as.factor(wineClusterResult2$cluster)
#Plotting the bar plot of the cluster assignments by wine quality.
wineClusterResult2 %>%
  ggplot(aes(x = WineQuality, fill = cluster)) +
  geom_bar(stat="count") +
  labs(title = "Cluster Assignments by Wine Quality") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 15))

Unsurprisingly, the associated bar chart shows that all the wine samples were placed into a single cluster. It is clear that the cosine distance measure is unable to find any significant differences in the data examples that could result in them being grouped differently.

The third model will use the Manhattan distance measure.

set.seed(12345)
#Run the third HAC using Manhattan distance and complete linkage.
HAC3 <- hclust(distMan, method = "complete")

The model will be visualized using a colored dendrogram. Setting the h parameter to 24 results in a lower number of clusters so it is more comparable to the Euclidean distance model. This will be seen in the size of the clusters in the color dendrogram. Setting the h parameter too low results in several hundred potential clusters.

#Plot the colored dendrogram based on the HAC.
dendrogram3 <- as.dendrogram(HAC3)
dendrogram_color3 <- color_branches(dendrogram3, h=24)
plot(dendrogram_color3)

This dendrogram features nine clusters represented by the different colors. The wine samples are more evenly distributed throughout the clusters than how they were in the Euclidean distance model. Again there is a single data example cluster on the left side of the model. The large blue and pink clusters from the Euclidean model have each been subdivided into a few more clusters.

#Using cutree to cut the tree at the same heights as the color dendrogram.
cuts3 <- cutree(HAC3, h = 24)
#Copying wineCluster to a new results data frame.
wineClusterResult3 <- wineCluster
#Adding the wine quality rating back into the data set.
wineClusterResult3$WineQuality <- wine$WineQuality
#Adding the cluster assignments to the data set based on the level at which the trees were cut.
wineClusterResult3 <- mutate(wineClusterResult3, cluster = cuts3)
wineClusterResult3$cluster <- as.factor(wineClusterResult3$cluster)
#Plotting the bar plot of the cluster assignments by wine quality.
wineClusterResult3 %>%
  ggplot(aes(x = WineQuality, fill = cluster)) +
  geom_bar(stat="count") +
  labs(title = "Cluster Assignments by Wine Quality") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 15))

The associated bar plot reveals that the wine samples with different wine quality ratings are once again spread throughout all of the clusters. The clusters are not related in any way to the wine quality ratings, as each bar has similar color bands running through it.

The last model will use the Minkowski distance measure.

set.seed(12345)
#Run the fourth HAC using Minkowski distance and complete linkage.
HAC4 <- hclust(distMin, method = "complete")

The Minkowski model uses an h value of 16 to keep the number of clusters consistent with the other models.

#Plot the colored dendrogram based on the HAC.
dendrogram4 <- as.dendrogram(HAC4)
dendrogram_color4 <- color_branches(dendrogram4, h=16)
plot(dendrogram_color4)

The resulting dendrogram has a very similar make-up when compared to the Euclidean model. This time however, the purple cluster is the largest, and the blue and pink clusters have been reduced in size. Once again, there is a single wine sample cluster in red. Also a small yellow cluster has appeared to the left of the green cluster.

#Using cutree to cut the tree at the same heights as the color dendrogram.
cuts4 <- cutree(HAC4, h = 16)
#Copying wineCluster to a new results data frame.
wineClusterResult4 <- wineCluster
#Adding the wine quality rating back into the data set.
wineClusterResult4$WineQuality <- wine$WineQuality
#Adding the cluster assignments to the data set based on the level at which the trees were cut.
wineClusterResult4 <- mutate(wineClusterResult4, cluster = cuts4)
wineClusterResult4$cluster <- as.factor(wineClusterResult4$cluster)
#Plotting the bar plot of the cluster assignments by wine quality.
wineClusterResult4 %>%
  ggplot(aes(x = WineQuality, fill = cluster)) +
  geom_bar(stat="count") +
  labs(title = "Cluster Assignments by Wine Quality") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 15))

The bar plot tells a story that is very similar to the rest of the HAC models as well as the K-Means models. The HAC algorithm is not able to create clusters that reflect the different wine quality ratings. Again, the samples are distributed evenly throughout the different clusters. The color bands in each bar are very similar to one another, which means that the characteristics of the wine samples do not separate them into different quality rating groups when clustered.

HAC Model Comparison and Selection

Like the K-Means models, the HAC models were not able to cluster the data according to the output variable. The cosine distance model performed especially poorly, just placing the entire data set into a single cluster. The rest of the distance measures gave similar results. When compared specifically to the wine quality ratings, the clusters appear to just be assigned at random. All of the clusters contain a variety of the wine samples with different quality ratings. This clustering technique does not perform well for this specific task related to the data set.

Although none of the models provide insight towards solving this clustering problem, the Minkowski model will be selected as the final model. Both the dendrogram and bar plot are very clear in showing the model’s inability to identify the different wine quality rating groups. It is similar to the Euclidean and Manhattan models, while offering a significant advantage over the cosine model which did not provide any useful insights whatsoever.

Classification by Decision Tree Models

A decision tree is made up of a series of nodes and splits, each focusing on different variables in the data set. Based on the values of each of the variables, a diagram of “yes-no” decisions are made that lead to a prediction. In this case, the model will be predicting the quality rating for each wine sample based on the values of their physiochemical properties. Starting from the first node, the model will choose a variable and a boundary value. Depending on the actual value of the variable for each wine sample compared to the boundary value, each node will branch off in two directions leading to either a terminal node, where a final prediction is made, or to a node that features another split based on the boundary value of a variable. This type of classification model is easy to read and interpret because all the decisions and predictions are plainly displayed for the reader. By following each branch of the tree from the top to the bottom, one can know exactly why the model predicted each output for each wine sample.

For this analysis, several decision tree models will be run. The parameters will be tuned to create trees that are pruned to various levels or completely unpruned. The models will be compared in terms of accuracy, efficiency, and readability. Since the main advantage of using a decision tree model is how it is possible to see every decision the algorithm makes to come to a prediction, this aspect will be weighed heavily when choosing the best decision tree model.

The first decision tree will be created using the default parameters.

set.seed(12345)
DT_model1 <- rpart(WineQuality ~ ., data = wineTrain, method = "class")

The following output reveals that the default decision tree will feature 5 splits. Also, the relative error is very high at almost 0.9. This means that the decision tree is not accurate when predicting the wine quality rating.

rsq.rpart(DT_model1)
## 
## Classification tree:
## rpart(formula = WineQuality ~ ., data = wineTrain, method = "class")
## 
## Variables actually used in tree construction:
## [1] FreeSO2         PercentAlc      VolatileAcidity
## 
## Root node error: 1892/3431 = 0.55144
## 
## n= 3431 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.051004      0   1.00000 1.00000 0.015397
## 2 0.024313      2   0.89799 0.90592 0.015480
## 3 0.020613      3   0.87368 0.89746 0.015479
## 4 0.019027      4   0.85307 0.88689 0.015476
## 5 0.010000      5   0.83404 0.85835 0.015458

The following plot is the actual decision tree used by the model to predict the wine quality ratings. The decision tree works by having each data example answer a series of yes or no questions based on the values of the variables. In this model, if the percent alcohol of a wine sample is less than 11 then the model splits to the left, then asking if the volatile acidity variable is greater than or equal to 0.24, and so on. If the percent alcohol is greater than or equal to 11 then the model splits to the right and asks if the percent alcohol is less than 13, and so on. This decision tree entirely focused on only three variables: PercentAlc, VolatileAcidity, and FreeSO2. Also, because there are only six terminal nodes, there are not enough nodes to represent every wine quality rating. This model will only ever guess that a wine is a 5, 6, or 7 based on the terminal nodes. So, there is no way it will ever classify wine samples with ratings of 3, 4, 8, or 9 correctly.

fancyRpartPlot(DT_model1)

preds_DT1 <- predict(DT_model1, newdata = wineTest, type = "class")
confusionMatrix(preds_DT1, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   0   0   0   0   0   0
##          5   1  21 194  99  12   0   0
##          6   4  26 241 518 200  36   0
##          7   1   1   2  42  52  16   1
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5208          
##                  95% CI : (0.4949, 0.5466)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 2.27e-08        
##                                           
##                   Kappa : 0.2087          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000  0.00000   0.4439   0.7860  0.19697  0.00000
## Specificity           1.00000  1.00000   0.8709   0.3725  0.94763  1.00000
## Pos Pred Value            NaN      NaN   0.5933   0.5054  0.45217      NaN
## Neg Pred Value        0.99591  0.96728   0.7868   0.6810  0.84320  0.96455
## Prevalence            0.00409  0.03272   0.2979   0.4492  0.17996  0.03545
## Detection Rate        0.00000  0.00000   0.1322   0.3531  0.03545  0.00000
## Detection Prevalence  0.00000  0.00000   0.2229   0.6987  0.07839  0.00000
## Balanced Accuracy     0.50000  0.50000   0.6574   0.5793  0.57230  0.50000
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

Based on the limitations of the decision tree, it is unsurprising that this model has low accuracy. The accuracy of the model is only 52.08%. However, this model is still significantly better than the No Information Rate (NIR) model with 44.92% accuracy. The NIR model is a baseline model that classifies every data example as the most prevalent class without even looking at the input data. In this case, the NIR predicts that every wine has a quality rating of 6. Examining the confusion matrix reveals that the model makes many mistakes even when classifying those wines that do have a rating of 5, 6, or 7. When a wine has a rating of 5, the model guessed that it was a 5 or a 6 an almost equal number of times. Out of the quality ratings that the model could guess, the model had the worst accuracy classifying wines with a rating of 7. It incorrectly classified 212 out of the 264 samples.

The next model will be pruned less to see if the accuracy can be improved. Ideally, this model would feature enough terminal nodes so that every wine quality rating can be guessed.

set.seed(12345)
DT_model2 <- rpart(WineQuality ~ ., data = wineTrain, method = "class", 
                   control = rpart.control(cp = 0.003), model = TRUE)

Setting the complexity parameter to 0.003 will result in a tree with more splits and nodes than the default model. This complexity parameter value will make the model less restrictive to create a split, by requiring the minimum benefit that split will add to the tree is less than the default model. This results in a decision tree with 14 splits. The plots reveal that the accuracy of this model did not meaningfully improve over the default model.

rsq.rpart(DT_model2)
## 
## Classification tree:
## rpart(formula = WineQuality ~ ., data = wineTrain, method = "class", 
##     model = TRUE, control = rpart.control(cp = 0.003))
## 
## Variables actually used in tree construction:
## [1] Chlorides       FreeSO2         PercentAlc      Sulfates       
## [5] TotalSO2        VolatileAcidity
## 
## Root node error: 1892/3431 = 0.55144
## 
## n= 3431 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0510042      0   1.00000 1.00000 0.015397
## 2  0.0243129      2   0.89799 0.90592 0.015480
## 3  0.0206131      3   0.87368 0.89746 0.015479
## 4  0.0190275      4   0.85307 0.88689 0.015476
## 5  0.0079281      5   0.83404 0.85729 0.015457
## 6  0.0058140      7   0.81818 0.84989 0.015449
## 7  0.0044926      8   0.81237 0.83879 0.015436
## 8  0.0042283     10   0.80338 0.84302 0.015441
## 9  0.0036998     11   0.79915 0.84038 0.015438
## 10 0.0031712     12   0.79545 0.83457 0.015431
## 11 0.0030000     14   0.78911 0.83140 0.015426

The following decision tree has more terminal nodes than the default model. But even with fifteen terminal nodes, the model still does not guess any wine quality ratings other than 5, 6, or 7. This decision tree also has the same three initial splits as the default model. Even with the rest of the tree having a very different shape, the accuracy does not greatly improve. As long as the model is only guessing between 3 of the wine quality ratings, it will never be able to predict the ratings of all the wine samples.

fancyRpartPlot(DT_model2)

preds_DT2 <- predict(DT_model2, newdata = wineTest, type = "class")
confusionMatrix(preds_DT2, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   0   0   0   0   0   0
##          5   1  24 228 124  14   1   0
##          6   4  23 208 497 201  37   0
##          7   1   1   1  38  49  14   1
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5276          
##                  95% CI : (0.5017, 0.5534)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 1.044e-09       
##                                           
##                   Kappa : 0.2262          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000  0.00000   0.5217   0.7542  0.18561  0.00000
## Specificity           1.00000  1.00000   0.8408   0.4146  0.95345  1.00000
## Pos Pred Value            NaN      NaN   0.5816   0.5124  0.46667      NaN
## Neg Pred Value        0.99591  0.96728   0.8056   0.6740  0.84214  0.96455
## Prevalence            0.00409  0.03272   0.2979   0.4492  0.17996  0.03545
## Detection Rate        0.00000  0.00000   0.1554   0.3388  0.03340  0.00000
## Detection Prevalence  0.00000  0.00000   0.2672   0.6612  0.07157  0.00000
## Balanced Accuracy     0.50000  0.50000   0.6813   0.5844  0.56953  0.50000
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

This model is not even 1% more accurate then the default model, with an accuracy of 52.76%. The model still exhibits all of the same issues as well. Looking at the confusion matrix, the model is still predicting that a wine that is actually rated as a 5 is a 5 or a 6 in equal amounts. The model still has extremely low accuracy when predicting a wine that is actually rated as a 7 as well.

Next, a completely unpruned decision tree will be created to see if there is any circumstance where this type of model can have sufficient accuracy.

set.seed(12345)
#Creating completely unpruned decision tree.
DT_model3 <- rpart(WineQuality ~ ., data = wineTrain, method = "class",
                      control = rpart.control(minbucket = 1, minsplit = 1, cp = -1),
                      model = TRUE)

The following plots reveal that this decision tree features well over 800 splits. The error has also been somewhat reduced compared to the other models, but is still around 0.8. Since the decision tree is so large, it cannot be visualized or interpreted.

rsq.rpart(DT_model3)
## 
## Classification tree:
## rpart(formula = WineQuality ~ ., data = wineTrain, method = "class", 
##     model = TRUE, control = rpart.control(minbucket = 1, minsplit = 1, 
##         cp = -1))
## 
## Variables actually used in tree construction:
##  [1] Chlorides       CitricAcid      Density         FixedAcidity   
##  [5] FreeSO2         PercentAlc      pH              ResidSugar     
##  [9] Sulfates        TotalSO2        VolatileAcidity
## 
## Root node error: 1892/3431 = 0.55144
## 
## n= 3431 
## 
##             CP nsplit  rel error  xerror     xstd
## 1   0.05100423      0 1.00000000 1.00000 0.015397
## 2   0.02431290      2 0.89799154 0.90592 0.015480
## 3   0.02061311      3 0.87367865 0.89746 0.015479
## 4   0.01902748      4 0.85306554 0.88689 0.015476
## 5   0.00792812      5 0.83403805 0.85729 0.015457
## 6   0.00581395      7 0.81818182 0.84989 0.015449
## 7   0.00449260      8 0.81236786 0.83879 0.015436
## 8   0.00422833     10 0.80338266 0.84249 0.015441
## 9   0.00369979     11 0.79915433 0.84144 0.015439
## 10  0.00317125     12 0.79545455 0.83510 0.015431
## 11  0.00285412     14 0.78911205 0.83404 0.015430
## 12  0.00264271     24 0.75000000 0.82294 0.015413
## 13  0.00251057     29 0.73678647 0.81818 0.015406
## 14  0.00237844     33 0.72674419 0.82400 0.015415
## 15  0.00211416     54 0.67441860 0.81924 0.015407
## 16  0.00193798     67 0.64693446 0.82030 0.015409
## 17  0.00190275     73 0.63530655 0.81501 0.015400
## 18  0.00184989     78 0.62579281 0.81501 0.015400
## 19  0.00179704     91 0.60147992 0.81501 0.015400
## 20  0.00176180     98 0.58668076 0.81501 0.015400
## 21  0.00158562    104 0.57610994 0.80920 0.015390
## 22  0.00145349    147 0.50739958 0.81765 0.015405
## 23  0.00140944    151 0.50158562 0.81607 0.015402
## 24  0.00132135    157 0.49312896 0.81712 0.015404
## 25  0.00118922    161 0.48784355 0.82294 0.015413
## 26  0.00105708    165 0.48308668 0.81395 0.015398
## 27  0.00098158    283 0.35623679 0.81290 0.015396
## 28  0.00095137    297 0.34196617 0.81607 0.015402
## 29  0.00092495    306 0.33298097 0.81554 0.015401
## 30  0.00088090    315 0.32188161 0.81765 0.015405
## 31  0.00084567    328 0.30972516 0.81342 0.015397
## 32  0.00079281    333 0.30549683 0.81501 0.015400
## 33  0.00070472    361 0.28276956 0.81078 0.015393
## 34  0.00066068    376 0.27219873 0.81290 0.015396
## 35  0.00052854    388 0.26427061 0.81290 0.015396
## 36  0.00042283    792 0.04598309 0.77590 0.015318
## 37  0.00039641    797 0.04386892 0.77537 0.015316
## 38  0.00035236    805 0.04069767 0.77484 0.015315
## 39  0.00026427    836 0.02959831 0.77484 0.015315
## 40  0.00017618    944 0.00105708 0.77484 0.015315
## 41 -1.00000000    947 0.00052854 0.77484 0.015315

Even though the decision tree cannot be seen, the accuracy of the model will still be evaluated.

preds3 <- predict(DT_model3, newdata = wineTest, type = "class")
confusionMatrix(preds3, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   2   2   0   0   0
##          4   0  12  15  12   3   0   0
##          5   2  18 262 113  16   7   0
##          6   4  14 129 424  93  11   1
##          7   0   4  25  95 139  17   0
##          8   0   0   4  13  13  17   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5821          
##                  95% CI : (0.5564, 0.6075)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3779          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000  0.25000   0.5995   0.6434  0.52652  0.32692
## Specificity          0.997262  0.97886   0.8485   0.6881  0.88279  0.97880
## Pos Pred Value       0.000000  0.28571   0.6268   0.6272  0.49643  0.36170
## Neg Pred Value       0.995899  0.97474   0.8332   0.7029  0.89469  0.97535
## Prevalence           0.004090  0.03272   0.2979   0.4492  0.17996  0.03545
## Detection Rate       0.000000  0.00818   0.1786   0.2890  0.09475  0.01159
## Detection Prevalence 0.002727  0.02863   0.2849   0.4608  0.19087  0.03204
## Balanced Accuracy    0.498631  0.61443   0.7240   0.6658  0.70465  0.65286
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

This model has the highest accuracy of the three decision trees, although it is still only 58.21%. Besides the accuracy, the greatest advantage of this model is that it is now able to guess more than 5, 6, or 7. The model has now made predictions for the wine quality ratings of 4, 5, 6, 7, and 8. The accuracy of the predictions for the wine samples with a rating of 7 has improved as well. The main flaw of this model is that it cannot be visualized, meaning that no one can see the model’s rationale for each of its predictions. One final decision tree model will be created. This model will aim to be slightly less pruned than the second model to see if a tree that is still readable can predict more than quality ratings of 5, 6, or 7.

set.seed(12345)
DT_model4 <- rpart(WineQuality ~ ., data = wineTrain, method = "class", 
                   control = rpart.control(cp = 0.002), model = TRUE)

Unfortunately, lowering the complexity parameter just 0.001, increases the number of splits to over 60. The following plots also show that this model’s accuracy is more similar to the first 3 than the completely unpruned model.

rsq.rpart(DT_model4)
## 
## Classification tree:
## rpart(formula = WineQuality ~ ., data = wineTrain, method = "class", 
##     model = TRUE, control = rpart.control(cp = 0.002))
## 
## Variables actually used in tree construction:
##  [1] Chlorides       CitricAcid      Density         FixedAcidity   
##  [5] FreeSO2         PercentAlc      pH              ResidSugar     
##  [9] Sulfates        TotalSO2        VolatileAcidity
## 
## Root node error: 1892/3431 = 0.55144
## 
## n= 3431 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0510042      0   1.00000 1.00000 0.015397
## 2  0.0243129      2   0.89799 0.90592 0.015480
## 3  0.0206131      3   0.87368 0.89746 0.015479
## 4  0.0190275      4   0.85307 0.88689 0.015476
## 5  0.0079281      5   0.83404 0.85729 0.015457
## 6  0.0058140      7   0.81818 0.84989 0.015449
## 7  0.0044926      8   0.81237 0.83879 0.015436
## 8  0.0042283     10   0.80338 0.84302 0.015441
## 9  0.0036998     11   0.79915 0.84038 0.015438
## 10 0.0031712     12   0.79545 0.83457 0.015431
## 11 0.0028541     14   0.78911 0.83034 0.015425
## 12 0.0026427     24   0.75000 0.82082 0.015410
## 13 0.0024665     38   0.71247 0.82241 0.015413
## 14 0.0023784     41   0.70507 0.82505 0.015417
## 15 0.0022903     54   0.67125 0.82030 0.015409
## 16 0.0021142     57   0.66438 0.82030 0.015409
## 17 0.0020000     63   0.65169 0.82082 0.015410

Even though this decision tree can technically be visualized, it is much too large to be interpretable. The advantage of this model is that it does have enough terminal nodes to make predictions of 4, 5, 6, 7, and 8.

fancyRpartPlot(DT_model4)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

preds_DT4 <- predict(DT_model4, newdata = wineTest, type = "class")
confusionMatrix(preds_DT4, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   3   7   4   1   0   0
##          5   1  25 229 119   8   0   0
##          6   4  19 186 425 145  27   1
##          7   1   1  15 111 110  22   0
##          8   0   0   0   0   0   3   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5249         
##                  95% CI : (0.499, 0.5507)
##     No Information Rate : 0.4492         
##     P-Value [Acc > NIR] : 3.693e-09      
##                                          
##                   Kappa : 0.2611         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000 0.062500   0.5240   0.6449  0.41667 0.057692
## Specificity           1.00000 0.991543   0.8515   0.5272  0.87531 1.000000
## Pos Pred Value            NaN 0.200000   0.5995   0.5266  0.42308 1.000000
## Neg Pred Value        0.99591 0.969008   0.8083   0.6455  0.87241 0.966530
## Prevalence            0.00409 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate        0.00000 0.002045   0.1561   0.2897  0.07498 0.002045
## Detection Prevalence  0.00000 0.010225   0.2604   0.5501  0.17723 0.002045
## Balanced Accuracy     0.50000 0.527022   0.6877   0.5861  0.64599 0.528846
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

The accuracy of this model is only 52.49%. The model somehow became less accurate than the second model, while being able to guess more of the quality ratings. The confusion matrix looks similar to the others, with many mistakes being made. The model is still wrong almost 50% of the time when classifying wines that have an actual quality rating of 5 or 7.

Decision Tree Model Comparison and Selection

None of the decision tree models performed particularly well. All of the models had significant accuracy problems predicting each one of the quality ratings. The majority of the models had significantly less than 50% accuracy when classifying wines with a quality rating of 7. The models did best when predicting wines with a quality rating of 6, although they still consistently made over 100 mistakes for this rating.

When assessing the models individually, even the completely unpruned model did not reach 60% accuracy. Considering the accuracy of this model combined with the fact that it cannot be visualized, it will not be selected as the final model. Among the other three models, the decision tree with the complexity parameter of 0.003 had the second best accuracy, even though it never classified the wine samples with quality ratings of 3, 4, 8, or 9 correctly. Therefore, this model will be chosen as the final decision tree model.

Classification by Random Forest Models

Random forest models are created by combining many decision trees to produce a more accurate prediction. The advantage of random forest models is that they are very good at solving classification problems. The disadvantages are that they are prone to overfitting and are a “black box” method. This means that unlike decision trees, where one can see all the nodes and splits to understand how each classification decision is made, a user cannot directly see in the inner workings of the random forest.

(reference: https://deepai.org/machine-learning-glossary-and-terms/random-forest)

The first random forest model will be built using 300 trees. This large number of trees will reveal at how many trees the error of the model begins to be minimized. From there, the number of trees can be reduced to save computational time while retaining the accuracy of the larger model.

#Create the random forest model.
set.seed(12345)
rf1 <- randomForest(WineQuality ~ ., data = wineTrain, ntree = 300, importance = TRUE, proximity = TRUE)

print(rf1)
## 
## Call:
##  randomForest(formula = WineQuality ~ ., data = wineTrain, ntree = 300,      importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 32.85%
## Confusion matrix:
##   3  4   5    6   7  8 9 class.error
## 3 0  0   6    8   0  0 0   1.0000000
## 4 0 27  53   31   4  0 0   0.7652174
## 5 0  3 663  346   8  0 0   0.3500000
## 6 0  0 209 1228 101  1 0   0.2020793
## 7 0  0   8  268 335  5 0   0.4561688
## 8 0  0   1   36  35 51 0   0.5853659
## 9 0  0   0    1   3  0 0   1.0000000

In the above confusion matrix, the class error for each wine quality rating is shown. This describes how often the model gets the prediction for a particular wine quality rating wrong. This output is related to the plot below. On this plot, each of the different colored lines represents the error rate of the model for each wine quality rating. These can be matched with the wine quality rating they represent by using the previously mentioned confusion matrix.

plot(rf1)

The above plot reveals that the overall error of the model seems to minimize between 50 and 100 trees. There is very little improvement for any of the class error rates past the 100 tree mark.

The following plot shows the number of nodes that each variable is featured in, as well as their mean minimal depth. PercentAlc is featured in the trees the least number of times, but when it does appear it is almost always at the top of the tree. Density, VolatileAcidity, and Chlorides are also frequently featured at the higher levels of the trees. All of the variables besides PercentAlc are featured in nodes a great number of times.

plot_multi_way_importance(measure_importance(rf1), size_measure = "no_of_nodes")

The following plot visualizes the same data in a different way. PercentAlc, Density, VolatileAcidity, and Chlorides are featured in the initial node most often. The rest of the variables have much greater mean minimal depth. All of the variables on the plot are found in all of the trees. This means that to some extent, the random forest is using the values of all the variables to make its decisions.

plot_min_depth_distribution(min_depth_distribution(rf1))

#Evaluate the accuracy of the model using the test data.
predsRF1 <- predict(rf1, newdata = wineTest, type = "class")
confusionMatrix(predsRF1, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   5   1   1   0   0   0
##          5   2  26 303  78   8   0   0
##          6   4  17 131 533 111  21   0
##          7   0   0   2  46 145  17   1
##          8   0   0   0   1   0  14   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6817          
##                  95% CI : (0.6571, 0.7055)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5016          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000 0.104167   0.6934   0.8088  0.54924 0.269231
## Specificity           1.00000 0.998591   0.8893   0.6485  0.94514 0.999293
## Pos Pred Value            NaN 0.714286   0.7266   0.6524  0.68720 0.933333
## Neg Pred Value        0.99591 0.970548   0.8724   0.8062  0.90525 0.973829
## Prevalence            0.00409 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate        0.00000 0.003408   0.2065   0.3633  0.09884 0.009543
## Detection Prevalence  0.00000 0.004772   0.2843   0.5569  0.14383 0.010225
## Balanced Accuracy     0.50000 0.551379   0.7913   0.7287  0.74719 0.634262
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

The random forest model has much higher accuracy than the decision tree models. The accuracy of this model is 68.17% compared to the NIR of 44.92%. Similar to the decision tree models, this one never correctly classifies a wine that has a quality rating of 3 or 9. It does however, correctly classify the wines with ratings of 5, 6, and 7 much more often than the decision tree models did. This model predicts a wine quality of 5 for about three quarters of the wines that were actually rated a 5, compared to the close to 50/50 split of the decision tree models.

The second random forest model will be limited to 100 trees. This decision is based on the plot from the previous model. The error rate approached a minimum for all the classes before the 100 tree mark. This model will be significantly more efficient to run while hopefully having very similar accuracy to the 300 tree model.

#Create the random forest model.
set.seed(12345)
rf2 <- randomForest(WineQuality ~ ., data = wineTrain, ntree = 100, importance = TRUE, proximity = TRUE)

print(rf2)
## 
## Call:
##  randomForest(formula = WineQuality ~ ., data = wineTrain, ntree = 100,      importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 33.58%
## Confusion matrix:
##   3  4   5    6   7  8 9 class.error
## 3 0  0   5    9   0  0 0   1.0000000
## 4 1 27  52   31   4  0 0   0.7652174
## 5 0  4 657  347  12  0 0   0.3558824
## 6 0  1 224 1210 104  0 0   0.2137752
## 7 0  0   8  264 334 10 0   0.4577922
## 8 0  0   1   33  38 51 0   0.5853659
## 9 0  0   0    2   2  0 0   1.0000000

The following plot reveals that the class error rates are indeed leveling off at a minimum by the 100 tree mark.

plot(rf2)

The following plot is almost identical to the previous model’s. PercentAlc is featured in a low number of nodes, but when it is, it is always at the top of the tree. Density, VolatileAcidity, and Chlorides are similarly grouped towards the upper left corner of the plot. pH and Sulfates remain in the bottom right corner of the plot with the greatest mean minimal depth.

plot_multi_way_importance(measure_importance(rf2), size_measure = "no_of_nodes")

Again, PercentAlc, Density, VolatileAcidity, and Chlorides are featured in the initial split most often. PercentAlc has the lowest mean minimal depth by far, followed by Density and VolatileAcidity which both have values under 2. The rest of the variables all have a mean minimal depth greater than 2.

plot_min_depth_distribution(min_depth_distribution(rf2))

#Evaluate the accuracy of the model using the test data.
predsRF2 <- predict(rf2, newdata = wineTest, type = "class")
confusionMatrix(predsRF2, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   5   2   1   0   0   0
##          5   2  27 295  80   8   0   0
##          6   4  16 137 527 110  19   0
##          7   0   0   3  50 145  19   1
##          8   0   0   0   1   1  14   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6721          
##                  95% CI : (0.6474, 0.6961)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4878          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000 0.104167   0.6751   0.7997  0.54924 0.269231
## Specificity           1.00000 0.997886   0.8864   0.6460  0.93932 0.998587
## Pos Pred Value            NaN 0.625000   0.7160   0.6482  0.66514 0.875000
## Neg Pred Value        0.99591 0.970528   0.8654   0.7982  0.90472 0.973811
## Prevalence            0.00409 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate        0.00000 0.003408   0.2011   0.3592  0.09884 0.009543
## Detection Prevalence  0.00000 0.005453   0.2808   0.5542  0.14860 0.010907
## Balanced Accuracy     0.50000 0.551026   0.7807   0.7229  0.74428 0.633909
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

As expected, this model has a greatly reduced number of trees, but almost the same accuracy as the 300 tree model. The accuracy of this model is 67.21% compared to the NIR of 44.92%. The confusion matrix reveals that this model retains the same advantages over the decision tree models. It is correctly classifying wines with a quality rating of 7 more often than not, and the accuracy when classifying wines with a quality rating of 5 is well above 50%.

The next model will represent another attempt at improving the efficiency of the model while not sacrificing accuracy. By setting the mtry parameter to 5, the number of randomly selected variables the model is allowed to try at each split of each decision tree is reduced to 5. This approximately halves the number of variables that can be tried at each split.

#Create the random forest model.
set.seed(12345)
rf3 <- randomForest(WineQuality ~ ., data = wineTrain, ntree = 100, mtry = 5, importance = TRUE, proximity = TRUE)

print(rf3)
## 
## Call:
##  randomForest(formula = WineQuality ~ ., data = wineTrain, ntree = 100,      mtry = 5, importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 33.4%
## Confusion matrix:
##   3  4   5    6   7  8 9 class.error
## 3 0  0   7    7   0  0 0   1.0000000
## 4 0 29  55   28   3  0 0   0.7478261
## 5 0  2 671  332  15  0 0   0.3421569
## 6 0  1 243 1190 102  3 0   0.2267706
## 7 0  0  11  260 342  3 0   0.4448052
## 8 0  0   2   32  36 53 0   0.5691057
## 9 0  0   0    2   2  0 0   1.0000000

This model’s error rates show a similar pattern compared to the previous two models. It appears that the error rates are again minimizing by the time the model reaches 100 trees, so the accuracy of this model should not be significantly reduced compared to the previous models.

plot(rf3)

The PercentAlc variable has a lower mean minimal depth than the previous models. Density is still located toward the upper area of the plot, with a mean minimal depth of 2. The rest of the variables are now clustering towards the bottom right corner of the plot, where previously they were more distributed towards the middle.

plot_multi_way_importance(measure_importance(rf3), size_measure = "no_of_nodes")

PercentAlc and Density are now featured in the initial split much more often than any of the other variables. PercentAlc now has a mean minimal depth of 0.73 while the rest of the variables have a much greater mean minimal depth. Even when reducing the number of randomly selected variables that can be tried at each split, all of the variables on the plot are still featured in all 100 trees.

plot_min_depth_distribution(min_depth_distribution(rf3))

#Evaluate the accuracy of the model using the test data.
predsRF3 <- predict(rf3, newdata = wineTest, type = "class")
confusionMatrix(predsRF3, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   7   1   1   0   0   0
##          5   3  26 287  87   8   0   0
##          6   3  15 145 520 116  17   1
##          7   0   0   4  50 139  21   0
##          8   0   0   0   1   1  14   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6592          
##                  95% CI : (0.6343, 0.6834)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4671          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000 0.145833   0.6568   0.7891  0.52652 0.269231
## Specificity           1.00000 0.998591   0.8796   0.6324  0.93766 0.998587
## Pos Pred Value            NaN 0.777778   0.6983   0.6365  0.64953 0.875000
## Neg Pred Value        0.99591 0.971879   0.8580   0.7862  0.90024 0.973811
## Prevalence            0.00409 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate        0.00000 0.004772   0.1956   0.3545  0.09475 0.009543
## Detection Prevalence  0.00000 0.006135   0.2802   0.5569  0.14588 0.010907
## Balanced Accuracy     0.50000 0.572212   0.7682   0.7108  0.73209 0.633909
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

The confusion matrix reveals that the accuracy of this model is slightly reduced compared to the others. The accuracy of the model is 65.92% compared to the NIR of 44.92%. This model is still predicting wines with a rating of 5 correctly much more often than it gets them wrong, but the accuracy when predicting wines with a rating of 7 has fallen closer to 50/50.

Reducing the number of trees and available variables has reduced the accuracy of the model. Even though the error rates appear to have reached their minimum before 300 trees, an unrestricted model will be created to see what might be the best case scenario for the accuracy rate. The ntree parameter is removed so the random forest will grow 500 trees.

#Create the random forest model.
set.seed(12345)
rf4 <- randomForest(WineQuality ~ ., data = wineTrain, importance = TRUE, proximity = TRUE)

print(rf4)
## 
## Call:
##  randomForest(formula = WineQuality ~ ., data = wineTrain, importance = TRUE,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 32.47%
## Confusion matrix:
##   3  4   5    6   7  8 9 class.error
## 3 0  0   6    8   0  0 0   1.0000000
## 4 0 26  52   33   4  0 0   0.7739130
## 5 0  3 667  341   9  0 0   0.3460784
## 6 0  0 210 1233  96  0 0   0.1988304
## 7 0  0   7  265 340  4 0   0.4480519
## 8 0  0   1   36  35 51 0   0.5853659
## 9 0  0   0    1   3  0 0   1.0000000

The error rates plot reveals that the 500 tree model does not have decreased error rates compared to the 300 tree model, or even the 100 tree model.

plot(rf4)

PercentAlc is still located in the upper left portion of the plot, with Density, Chlorides, and VolatileAcidity again clustered slightly below and to the right. With the greater number of trees, pH is now grouped closer to the secondary cluster of variables that was present in the plots of the other models. Sulfates remains the variable with the greatest mean minimal depth.

plot_multi_way_importance(measure_importance(rf4), size_measure = "no_of_nodes")

PercentAlc, VolatileAcidity, Density, and Chlorides are still featured in the initial split of the trees most often. Now, pH’s mean minimal depth has dropped below 3. PercentAlc, VolatileAcidity, and Density are still the only variables with a mean minimal depth less than 2. All of the variables in the plot are still used in all 500 of the trees.

plot_min_depth_distribution(min_depth_distribution(rf4))

#Evaluate the accuracy of the model using the test data.
predsRF4 <- predict(rf4, newdata = wineTest, type = "class")
confusionMatrix(predsRF4, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   5   2   1   0   0   0
##          5   2  27 299  75   7   0   0
##          6   4  16 133 534 114  20   0
##          7   0   0   3  48 143  18   1
##          8   0   0   0   1   0  14   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6783          
##                  95% CI : (0.6537, 0.7021)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4962          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000 0.104167   0.6842   0.8103  0.54167 0.269231
## Specificity           1.00000 0.997886   0.8922   0.6448  0.94181 0.999293
## Pos Pred Value            NaN 0.625000   0.7293   0.6504  0.67136 0.933333
## Neg Pred Value        0.99591 0.970528   0.8694   0.8065  0.90351 0.973829
## Prevalence            0.00409 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate        0.00000 0.003408   0.2038   0.3640  0.09748 0.009543
## Detection Prevalence  0.00000 0.005453   0.2795   0.5596  0.14519 0.010225
## Balanced Accuracy     0.50000 0.551026   0.7882   0.7276  0.74174 0.634262
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

Although this model was able to grow 200 more trees, it has lower accuracy than the 300 tree model. This model’s accuracy is 67.83% compared to the NIR of 44.92%. Again, this model is only correctly predicting wines with a quality rating of 7 slightly more often than it is getting those wrong. The accuracy when classifying wines with a rating of 5 is still improved compared to the decision tree models. Even with 500 trees, the random forest model is still not predicting that any of the wine samples could be rated a 3 or a 9.

Random Forest Model Comparison and Selection

The random forest models all performed similarly with accuracies in the upper 60s. Surprisingly, the 300 tree model performed better than the 500 tree model. The 100 tree model with the variable limitations performed the worst, with approximately 3% less accuracy than the 300 tree model. All of the models were unable to predict that a wine could have a quality rating of 3 or 9. However, these models were all able to predict the ratings of the wines that are actually rated a 5, 6, or 7 with much greater accuracy than the decision tree models. Combining many decision trees into a single model rather than relying on a single decision tree seems to be a much more robust solution for this classification problem.

The 300 tree model will be chosen as the final random forest model. The random forest models performed better than the decision trees, but still have fairly mediocre accuracy. Therefore, it is important to preserve the highest accuracy model to see what is the best accuracy that can be achieved while attempting to solve this classification problem.

Classification by Support Vector Machines (SVM)

SVM models are intended for binary classifications. The input data is used to create a geometric boundary, usually a line or linear hyperplane, between the two output classes. The model’s goal is to maximize the margin between the classes and the boundary. There are several methods the model can use to achieve this, which are known as kernels.

Although SVM is a binary classifier, it can be extended for use in multiclass predictions, like is the case where one of seven wine quality rating categories needs to be predicted. It can do this using either a pairwise (1 vs 1) approach or a 1 vs all approach where a wine quality rating would be classified by either 3 or not, 4 or not, 5 or not, etc, and then the most confident prediction is selected.

These models have a high tolerance to noisy data, are able to use any type of data as an input, and scale very well to large data sets. These advantages are balanced by the fact the model requires a number of parameters for each kernel type and models generated by non-linear kernels are extremely difficult to interpret.

In this analysis, nine total SVM models will be created. Three different kernels will be used each with three different cost parameters. The three kernels that will be used are linear, polynomial, and radial. The linear kernel is the most basic, one dimensional kernel. This one often performs best when there are many features, and it is faster to run than the other kernels. Polynomial kernels tend to be less efficient and accurate than the linear kernel as it is a more generalized representation. The radial kernel is one of the most popular kernels and is often used for non-linear data and is able to perform well even when there is no previous knowledge of the data.

(reference: https://dataaspirant.com/svm-kernels/#t-1608054630726)

Each of the different kernels will be tuned using cost parameters of 1, 10, and 50, where 1 is the model’s default value. The cost parameter is a regularization parameter that controls how many training errors are allowed when the model is built. When the cost value is large, the model is built with fewer training errors, but has a greater chance of overfitting. When the cost value is low, the model has more training errors, but a wider margin, and the model is more generalizable. The various cost parameters will be compared to see how they impact the accuracy of the models.

The SVM models will be presented without visualizations. The plot function is only able to visualize SVM models when it is being used to predict a binary output class.

The first group of three models will use the linear kernel. The cost parameter that produces the best accuracy will be compared against the best accuracy model of the other kernel types. Since SVM models require normalized data to produce the best results, the normalized classification data set will be used for all of the models.

The first linear kernel SVM model has a cost parameter of 1.

set.seed(12345)
#Creating linear kernel SVM, cost = 1.
SVM_L1 <- svm(WineQuality ~ ., data = wineDistTrain, kernel = "linear", 
             cost = 1)

#Evaluate the accuracy of the model.
preds_SVM_L1 <- predict(SVM_L1, newdata = wineDistTest, type = "class")
confusionMatrix(preds_SVM_L1, wineDistTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   0   0   0   0   0   0
##          5   3  28 233 106  10   2   0
##          6   3  20 204 553 254  50   1
##          7   0   0   0   0   0   0   0
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5358          
##                  95% CI : (0.5099, 0.5616)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 1.821e-11       
##                                           
##                   Kappa : 0.2135          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000  0.00000   0.5332   0.8392     0.00  0.00000
## Specificity           1.00000  1.00000   0.8553   0.3416     1.00  1.00000
## Pos Pred Value            NaN      NaN   0.6099   0.5097      NaN      NaN
## Neg Pred Value        0.99591  0.96728   0.8120   0.7225     0.82  0.96455
## Prevalence            0.00409  0.03272   0.2979   0.4492     0.18  0.03545
## Detection Rate        0.00000  0.00000   0.1588   0.3770     0.00  0.00000
## Detection Prevalence  0.00000  0.00000   0.2604   0.7396     0.00  0.00000
## Balanced Accuracy     0.50000  0.50000   0.6943   0.5904     0.50  0.50000
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

The first linear kernel model has severe difficulties classifying the wines by quality rating. The model only guesses that a wine sample would be rated as a 5 or a 6. Also, for when the wine samples do have a 5 rating, the model only correctly classifies them around half of the time. This model has an overall accuracy of 53.58%, although with it only predicting 2 of the 7 wine rating categories, it is unusable for this classification task.

The second linear kernel SVM model has a cost parameter of 10.

set.seed(12345)
#Creating linear kernel SVM, cost = 10.
SVM_L10 <- svm(WineQuality ~ ., data = wineDistTrain, kernel = "linear", 
             cost = 10)

#Evaluate the accuracy of the model.
preds_SVM_L10 <- predict(SVM_L10, newdata = wineDistTest, type = "class")
confusionMatrix(preds_SVM_L10, wineDistTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   0   0   0   0   0   0
##          5   2  28 233 105  10   2   0
##          6   4  20 204 554 254  50   1
##          7   0   0   0   0   0   0   0
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5365          
##                  95% CI : (0.5106, 0.5622)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 1.277e-11       
##                                           
##                   Kappa : 0.2143          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000  0.00000   0.5332   0.8407     0.00  0.00000
## Specificity           1.00000  1.00000   0.8573   0.3403     1.00  1.00000
## Pos Pred Value            NaN      NaN   0.6132   0.5097      NaN      NaN
## Neg Pred Value        0.99591  0.96728   0.8123   0.7237     0.82  0.96455
## Prevalence            0.00409  0.03272   0.2979   0.4492     0.18  0.03545
## Detection Rate        0.00000  0.00000   0.1588   0.3776     0.00  0.00000
## Detection Prevalence  0.00000  0.00000   0.2590   0.7410     0.00  0.00000
## Balanced Accuracy     0.50000  0.50000   0.6952   0.5905     0.50  0.50000
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

When the cost parameter is increased to 10, the model has almost identical accuracy. Looking at the confusion matrix, only two predictions changed. The model guessed a wine sample with a rating of 3 to be a six instead of a 5, and the model classified one additional wine sample with a rating of 6 correctly. Increasing the cost parameter to 10 did not improve the model.

The third linear kernel SVM model has a cost parameter of 50.

set.seed(12345)
#Creating linear kernel SVM, cost = 50.
SVM_L50 <- svm(WineQuality ~ ., data = wineDistTrain, kernel = "linear", 
             cost = 50)

#Evaluate the accuracy of the model.
preds_SVM_L50 <- predict(SVM_L50, newdata = wineDistTest, type = "class")
confusionMatrix(preds_SVM_L50, wineDistTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   0   0   0   0   0   0
##          5   3  28 232 105  10   2   0
##          6   3  20 205 554 254  50   1
##          7   0   0   0   0   0   0   0
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5358          
##                  95% CI : (0.5099, 0.5616)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 1.821e-11       
##                                           
##                   Kappa : 0.2132          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000  0.00000   0.5309   0.8407     0.00  0.00000
## Specificity           1.00000  1.00000   0.8563   0.3403     1.00  1.00000
## Pos Pred Value            NaN      NaN   0.6105   0.5097      NaN      NaN
## Neg Pred Value        0.99591  0.96728   0.8114   0.7237     0.82  0.96455
## Prevalence            0.00409  0.03272   0.2979   0.4492     0.18  0.03545
## Detection Rate        0.00000  0.00000   0.1581   0.3776     0.00  0.00000
## Detection Prevalence  0.00000  0.00000   0.2590   0.7410     0.00  0.00000
## Balanced Accuracy     0.50000  0.50000   0.6936   0.5905     0.50  0.50000
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

With a cost parameter of 50, the model has the same accuracy as when the cost parameter was 1. The confusion matrices are almost identical as well. These three models confirm that the linear kernel does a poor job of classifying the data and that increasing the cost parameter up to 50 does not improve the accuracy of the model.

The first polynomial kernel SVM has a cost parameter of 1.

set.seed(12345)
#Creating polynomial kernel SVM, cost = 1.
SVM_P1 <- svm(WineQuality ~ ., data = wineDistTrain, kernel = "polynomial", 
             cost = 1)

#Evaluate the accuracy of the model.
preds_SVM_P1 <- predict(SVM_P1, newdata = wineDistTest, type = "class")
confusionMatrix(preds_SVM_P1, wineDistTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   1   6   3   0   0   0   0
##          5   1  20 155  59   7   0   0
##          6   3  22 278 580 214  39   1
##          7   0   0   0  20  41  13   0
##          8   1   0   1   0   2   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5331          
##                  95% CI : (0.5071, 0.5588)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 7.328e-11       
##                                           
##                   Kappa : 0.213           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000 0.125000   0.3547   0.8801  0.15530 0.000000
## Specificity           1.00000 0.997181   0.9155   0.3106  0.97257 0.997173
## Pos Pred Value            NaN 0.600000   0.6405   0.5101  0.55405 0.000000
## Neg Pred Value        0.99591 0.971174   0.7698   0.7606  0.83991 0.964457
## Prevalence            0.00409 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate        0.00000 0.004090   0.1057   0.3954  0.02795 0.000000
## Detection Prevalence  0.00000 0.006817   0.1650   0.7751  0.05044 0.002727
## Balanced Accuracy     0.50000 0.561091   0.6351   0.5954  0.56394 0.498587
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

The polynomial kernel model is able to predict that a wine sample will have a 4, 5, 6, 7, or 8 rating. The accuracy of the model, however, is still at 53% so there is no improvement to the overall accuracy when compared to the linear kernel models. The polynomial kernel model classifies wines with a rating of 5 incorrectly more often than it does correctly. It also classifies the great majority of wines with a rating of 7 incorrectly. This model is only fairly accurate at classifying wines rated a 6.

The second polynomial kernel SVM model has a cost parameter of 10.

set.seed(12345)
#Creating polynomial kernel SVM, cost = 10.
SVM_P10 <- svm(WineQuality ~ ., data = wineDistTrain, kernel = "polynomial", 
             cost = 10)

#Evaluate the accuracy of the model.
preds_SVM_P10 <- predict(SVM_P10, newdata = wineDistTest, type = "class")
confusionMatrix(preds_SVM_P10, wineDistTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   2   1   0   0   0
##          4   1   8   4   2   0   0   0
##          5   1  20 187  85  11   1   0
##          6   4  20 236 535 188  34   1
##          7   0   0   7  36  63  14   0
##          8   0   0   0   0   2   3   0
##          9   0   0   1   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5426          
##                  95% CI : (0.5167, 0.5683)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 4.645e-13       
##                                           
##                   Kappa : 0.2516          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.166667   0.4279   0.8118  0.23864 0.057692
## Specificity          0.997947 0.995067   0.8854   0.4022  0.95262 0.998587
## Pos Pred Value       0.000000 0.533333   0.6131   0.5255  0.52500 0.600000
## Neg Pred Value       0.995902 0.972452   0.7849   0.7238  0.85078 0.966484
## Prevalence           0.004090 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate       0.000000 0.005453   0.1275   0.3647  0.04294 0.002045
## Detection Prevalence 0.002045 0.010225   0.2079   0.6939  0.08180 0.003408
## Balanced Accuracy    0.498973 0.580867   0.6567   0.6070  0.59563 0.528139
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          0.9993179
## Pos Pred Value       0.0000000
## Neg Pred Value       0.9993179
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0006817
## Balanced Accuracy    0.4996589

Changing the cost parameter of the polynomial kernel model to 10 has raised the accuracy by 1%. The accuracy of the model is now 54.26% compared to the NIR of 44.92%. While this model is marginally more accurate, it still displays all the same characteristics of the polynomial kernel model with cost parameter equal to 1. The model classifies the majority of wines that are rated as a 7 wrong, and is very inaccurate when classifying wines that are rated a 5.

The third polynomial kernel SVM model has a cost parameter of 50.

set.seed(12345)
#Creating polynomial kernel SVM, cost = 50.
SVM_P50 <- svm(WineQuality ~ ., data = wineDistTrain, kernel = "polynomial", 
             cost = 50)

#Evaluate the accuracy of the model.
preds_SVM_P50 <- predict(SVM_P50, newdata = wineDistTest, type = "class")
confusionMatrix(preds_SVM_P50, wineDistTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   1   4   0   0   0
##          4   1  10   8   4   1   0   0
##          5   1  19 187  86   8   1   0
##          6   4  18 227 501 169  27   1
##          7   0   1  12  58  81  15   0
##          8   0   0   0   4   5   9   0
##          9   0   0   2   2   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5372          
##                  95% CI : (0.5112, 0.5629)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 8.933e-12       
##                                           
##                   Kappa : 0.2621          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.208333   0.4279   0.7602  0.30682 0.173077
## Specificity          0.996578 0.990134   0.8883   0.4480  0.92851 0.993640
## Pos Pred Value       0.000000 0.416667   0.6192   0.5290  0.48503 0.500000
## Neg Pred Value       0.995896 0.973666   0.7854   0.6962  0.85923 0.970324
## Prevalence           0.004090 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate       0.000000 0.006817   0.1275   0.3415  0.05521 0.006135
## Detection Prevalence 0.003408 0.016360   0.2059   0.6455  0.11384 0.012270
## Balanced Accuracy    0.498289 0.599234   0.6581   0.6041  0.61767 0.583358
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          0.9972715
## Pos Pred Value       0.0000000
## Neg Pred Value       0.9993165
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0027267
## Balanced Accuracy    0.4986357

Setting the cost parameter to 50 has lowered the accuracy back down closer to the accuracy of the model with cost parameter equal to 1. Since this model is only about a half a percent more accurate than the cost parameter 1 model, it is unsurprising that the confusion matrix is extremely similar. Once again, the model shows all the same weaknesses when predicting wines that have a rating of 5 or 7.

The first radial kernel SVM model has a cost parameter of 1.

set.seed(12345)
#Creating radial kernel SVM, cost = 1.
SVM_R1 <- svm(WineQuality ~ ., data = wineDistTrain, kernel = "radial", 
             cost = 1)

#Evaluate the accuracy of the model.
preds_SVM_R1 <- predict(SVM_R1, newdata = wineDistTest, type = "class")
confusionMatrix(preds_SVM_R1, wineDistTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   0   0   0   0   0   0
##          5   1  29 252 111  11   1   0
##          6   5  19 182 511 169  30   0
##          7   0   0   3  37  84  21   1
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5774          
##                  95% CI : (0.5516, 0.6028)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3176          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000  0.00000   0.5767   0.7754  0.31818  0.00000
## Specificity           1.00000  1.00000   0.8515   0.4988  0.94846  1.00000
## Pos Pred Value            NaN      NaN   0.6222   0.5579  0.57534      NaN
## Neg Pred Value        0.99591  0.96728   0.8258   0.7314  0.86374  0.96455
## Prevalence            0.00409  0.03272   0.2979   0.4492  0.17996  0.03545
## Detection Rate        0.00000  0.00000   0.1718   0.3483  0.05726  0.00000
## Detection Prevalence  0.00000  0.00000   0.2761   0.6244  0.09952  0.00000
## Balanced Accuracy     0.50000  0.50000   0.7141   0.6371  0.63332  0.50000
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

Even though the radial kernel model with cost parameter equal to 1 only classifies the wines as being a 5, 6, or 7, it immediately has marginally better overall accuracy than the linear and polynomial kernel models. This accuracy of this model is 57.74% compared to the NIR of 44.92%. Although the model will never classify a wine rated 3, 4, 8, or 9 correctly, it is more accurate when classifying the wines rated as 5, 6, or 7. This model gets the wines rated as 5 right more often than it gets them wrong. It also has slightly improved accuracy when classifying the wines that are rated a 7.

The second radial kernel SVM model has a cost parameter of 10.

set.seed(12345)
#Creating radial kernel SVM, cost = 10.
SVM_R10 <- svm(WineQuality ~ ., data = wineDistTrain, kernel = "radial", 
             cost = 10)

#Evaluate the accuracy of the model.
preds_SVM_R10 <- predict(SVM_R10, newdata = wineDistTest, type = "class")
confusionMatrix(preds_SVM_R10, wineDistTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   6   4   1   0   0   0
##          5   4  28 251 108   7   1   0
##          6   2  14 172 478 139  14   1
##          7   0   0  10  69 117  34   0
##          8   0   0   0   3   1   3   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5828          
##                  95% CI : (0.5571, 0.6082)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3474          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000 0.125000   0.5744   0.7253  0.44318 0.057692
## Specificity           1.00000 0.996476   0.8563   0.5767  0.90607 0.997173
## Pos Pred Value            NaN 0.545455   0.6291   0.5829  0.50870 0.428571
## Neg Pred Value        0.99591 0.971154   0.8258   0.7202  0.88116 0.966438
## Prevalence            0.00409 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate        0.00000 0.004090   0.1711   0.3258  0.07975 0.002045
## Detection Prevalence  0.00000 0.007498   0.2720   0.5590  0.15678 0.004772
## Balanced Accuracy     0.50000 0.560738   0.7153   0.6510  0.67462 0.527433
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

Setting the cost parameter to 10 raises the accuracy of the model by about 0.5%. It also improves the model such that it now guesses that a wine could have a wine quality rating of 4 or 8, bringing it up to same level as the polynomial kernel models. The radial kernel continues to be more accurate than the polynomial kernel models. Looking at the confusion matrix shows that the model’s ability to classify wines that are rated as a 5 or 7 has improved as well.

The third radial kernel SVM model has a cost parameter of 50.

set.seed(12345)
#Creating radial kernel SVM, cost = 50.
SVM_R50 <- svm(WineQuality ~ ., data = wineDistTrain, kernel = "radial", 
             cost = 50)

#Evaluate the accuracy of the model.
preds_SVM_R50 <- predict(SVM_R50, newdata = wineDistTest, type = "class")
confusionMatrix(preds_SVM_R50, wineDistTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   1   0
##          4   0  13  14   7   0   0   0
##          5   5  23 271 112   8   2   0
##          6   1  11 139 443 112  14   0
##          7   0   1  12  88 135  23   1
##          8   0   0   1   7   9  12   0
##          9   0   0   0   2   0   0   0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5958         
##                  95% CI : (0.5701, 0.621)
##     No Information Rate : 0.4492         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3882         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                       Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.0000000 0.270833   0.6201   0.6722  0.51136  0.23077
## Specificity          0.9993155 0.985201   0.8544   0.6572  0.89609  0.98799
## Pos Pred Value       0.0000000 0.382353   0.6437   0.6153  0.51923  0.41379
## Neg Pred Value       0.9959072 0.975576   0.8413   0.7108  0.89312  0.97218
## Prevalence           0.0040900 0.032720   0.2979   0.4492  0.17996  0.03545
## Detection Rate       0.0000000 0.008862   0.1847   0.3020  0.09202  0.00818
## Detection Prevalence 0.0006817 0.023177   0.2870   0.4908  0.17723  0.01977
## Balanced Accuracy    0.4996578 0.628017   0.7373   0.6647  0.70373  0.60938
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          0.9986357
## Pos Pred Value       0.0000000
## Neg Pred Value       0.9993174
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0013633
## Balanced Accuracy    0.4993179

Raising the cost parameter to 50 continues to improve the accuracy of the model to 59.58%. This is the most accurate of all the SVM models. This model has the best accuracy when classifying wines that have a rating of 5 or 7. In both instances it gets more of them right than it does wrong. It is also able to correctly classify some of the wines rated as a 4 or an 8. Although this model represents an improvement, the accuracy is still fairly poor and it does not successfully classify the wine samples based on their quality ratings.

SVM Model Comparison and Selection

Although none of the SVM models performed particularly well, they did perform better than most of the other models. The linear kernel models performed the best overall, and the polynomial models were only slightly more accurate. The radial kernels performed the best, although their accuracy levels are still sorely lacking. When using the radial kernel, the accuracy improved as the cost parameter was increased. Increasing the cost parameter allowed the model to be able to correctly classify some of the wine samples that were rated as a 4 or 8, and improved the accuracy when classifying wine samples that were rated as a 5 or 7.

For these reasons, the radial kernel SVM with cost parameter equal to 50 will be chosen as the final SVM model. None of the SVM models were particularly good at classifying this data, and would not actually be applied as a classification model for the data set. It appears as though the machine learning algorithms are not able to differentiate the wine quality ratings based on the physiochemcial properties provided in the data set.

Classification by k-Nearest Neighbor (kNN) Models

k-Nearest Neighbor models use distance measures to make their predictions, similar to how Hierarchical Agglomerative Clustering alogrithms function. kNN is an example of instance-based learning, also known as “lazy learning”. This is because the model stores the training examples without doing any calculations during the training process. The classification and prediction steps are delayed until the model is given new examples from the testing data. The model will then compare the similarity of each test example, calculated according to the distance between the data examples, with all of the training examples that were provided. The model predicts the test example will be the same category as the majority category in the k nearest training examples. This type of model makes no assumptions about the data, works well with complex decision functions, and the decision boundary it creates has no predefined shape (unlike SVMs). However, the models are sensitive to noisy data and have a high computational cost because all of the computation takes place during the prediction step.

The kNN algorithm available in R uses the Euclidean distance measure calculation. Euclidean distance is the “straight-line” physical distance between the data entries. This means that the model requires the use of the normalized training and testing data.

The testing and training data will be slightly altered to run the kNN models. None of the values are changed, but the testing data is separated into a data frame that only contains the physiochemical property values and one that contains only the single column of wine quality ratings. The same is done for the training data.

#Separating principal component values from labels for use in the kNN model.
test_set_num_only <- wineDistTest[,-12]
test_set_labels <- wineDistTest[,12]

train_set_labels <- wineDistTrain[,12]
train_set_num_only <- wineDistTrain[,-12]

When creating kNN models, the value of k must be chosen manually. If the value of k is too small, the model will be highly sensitive to any noise present in the training data. If the value of k is too big, the k nearest neighbors will likely include data points from other classes and confuse the model.

To find the optimal value of k for this data set, a model will be created for each value of k from 1 to 100. The accuracy of each model evaluated against the testing data will be saved in a vector and that vector will be used to create a plot of model accuracy vs. value of k. The value of k that produces the highest accuracy will be chosen as the final model and its confusion matrix will be displayed.

set.seed(12345)
#Initializing vectors
kNN_accuracies <- c()
max_accuracy <- 0
max_accuracy_k <- 0

#Running the for loop that will save the accuracy of each model as the value of k is changed. k will take every value from 1 to 100. The accuracies are saved in the empty kNN_accuracies vector. The maximum accuracy and the k value that corresponds to that maximum accuracy are saved as the variables, max_accuracy and max_accuracy_k.
for (i in 1:100) {
  wineKNN <- knn(train = train_set_num_only, test = test_set_num_only,
                   cl = train_set_labels$WineQuality, k = i, prob = TRUE)
  cm <- confusionMatrix(wineKNN, test_set_labels$WineQuality)
  accuracy_i <- cm$overall["Accuracy"]
  kNN_accuracies <- append(kNN_accuracies, accuracy_i)
  if (accuracy_i > max_accuracy) {
    max_accuracy <- accuracy_i
    max_accuracy_k <- i
  }
}

#Create a data frame where one column is the values of k (the numbers from 1 to 100) and the second column is the accuracy that corresponds with each value of k.
accuracy_df <- data.frame(c(1:100), kNN_accuracies)
accuracy_df <- accuracy_df %>%
  rename(k = c.1.100., accuracy = kNN_accuracies)

#Use the accuracy data frame to create a scatter plot. The intersection of the blue and orange lines will reveal the k value of the model with the highest accuracy.
accuracy_df %>%
  ggplot(aes(x = k, y = accuracy)) +
  geom_hline(yintercept = max_accuracy, color = "blue") +
  geom_vline(xintercept = max_accuracy_k, color = "orange") +
  geom_point() +
  scale_y_continuous(limits = c(0.5, 0.7), n.breaks = 20) +
  scale_x_continuous(n.breaks = 20) +
  ggtitle("Model Accuracy According to Value of K") +
  labs(y = "Accuracy", x = "Value of k") +
  theme_classic()

The kNN model with the highest accuracy has a k value of 1. The model has the best accuracy when decisions are made based on what single data example is the closest neighbor. The accuracy drops very sharply as soon as k increases past 1, dropping 6% when k is 2. The accuracy further decreases from there, oscillating up and down with a slight overall downward trend as k approaches 100.

kNN Model Selection

The highest accuracy model using a k value of 1 will be re-run independently to examine its complete output.

set.seed(12345)
#Create the kNN model using k=1 and evaluate its accuracy against the test data.
wine_KNN1 <- knn(train = train_set_num_only, test = test_set_num_only,
                   cl = train_set_labels$WineQuality, k = 1, prob = TRUE)
confusionMatrix(wine_KNN1, test_set_labels$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   2   0   1   0   0
##          4   2  11  17   7   1   0   0
##          5   2  21 283 109  15   3   0
##          6   2  15 114 459  78  16   0
##          7   0   1  15  76 151  16   1
##          8   0   0   6   6  17  17   0
##          9   0   0   0   2   1   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6278          
##                  95% CI : (0.6025, 0.6526)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4435          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.229167   0.6476   0.6965   0.5720  0.32692
## Specificity          0.997947 0.980973   0.8544   0.7215   0.9094  0.97951
## Pos Pred Value       0.000000 0.289474   0.6536   0.6711   0.5808  0.36957
## Neg Pred Value       0.995902 0.974108   0.8511   0.7446   0.9064  0.97537
## Prevalence           0.004090 0.032720   0.2979   0.4492   0.1800  0.03545
## Detection Rate       0.000000 0.007498   0.1929   0.3129   0.1029  0.01159
## Detection Prevalence 0.002045 0.025903   0.2952   0.4663   0.1772  0.03136
## Balanced Accuracy    0.498973 0.605070   0.7510   0.7090   0.7407  0.65321
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          0.9979536
## Pos Pred Value       0.0000000
## Neg Pred Value       0.9993169
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0020450
## Balanced Accuracy    0.4989768

The model has an accuracy of 62.78% compared to the NIR of 44.35%. Like many of the other models, this one has difficulty differentiating wines rated as a 5 or a 6. There are 114 wines rated as a 5 that the model classified as having a rating of 6. Also, there were 109 wines that were rated as a 6 that the model classified as having a rating of 5. The model also has trouble identifying wines rated as an 8, with an almost equal number of predictions being assigned to 6, 7, and 8. Even though the accuracy of the model is not great, it is only 18% more accurate than just guessing a quality rating of 6 every time, it is statistically still a major improvement over the NIR model.

The following plots compare the distribution of the model’s predictions to the distribution of the actual data.

#Plotting the distribution of the model's predictions by digit.
plot(wine_KNN1, main = "Distribution of Predictions by Wine Quality Rating")

#Plotting the actual distribution of the test data by digit.
plot(test_set_labels, main = "Distribution of Wine Quality Rating in Test Data")

What is very interesting about this model is that the distribution of predictions is almost identical to the distribution of wine quality ratings in the test data. So even though it is wrong a little more than one-third of the time, the model still ends up with the correct proportion of the data that should be each rating.

Classification by Naive Bayes Models

Naive Bayes models classify examples by calculating probabilities based on the Bayes theorem. It is a relatively simple model that assumes independence among all the attributes when the class is given, which reduces the number of calculations necessary. It works by calculating the probability of observing all the attributes of a given class in conjunction with another. These pre-calculated probabilities are stored by the model, then a posterior probability is calculated for each new data example in the test data. The model picks the class prediction with the highest probability.

Naive Bayes often performs well even when the independence assumption is violated, making it a popular model for classification tasks. It is also resistant to inconsistent examples because each training example can incrementally increase or decrease the estimated probability that a hypothesis about one of the classes is correct. Sometimes there is a significant cost to compute all the probabilities, and it often requires initial knowledge of the probability estimates and additional assumptions may need to be made.

Naive Bayes has a problem when calculating probabilities where one of the probabilities is zero. If one of the conditional probabilities related to the attributes is zero, then the entire product becomes zero. This can be corrected by using Laplace smoothing, which replaces a zero probability with a very small nonzero probability.

The first step to selecting a model will be investigating how tuning the Laplace smoothing parameter affects the accuracy of the Naive Bayes models. A similar procedure to the one used for the kNN models will be run.

To find the optimal value of the Laplace smoothing parameter, a model will be created for each value of the parameter from 0 to 100. 0 is the default value for the parameter. The accuracy of each model evaluated against the testing data will be saved in a vector and that vector will be used to create a plot of model accuracy vs. value of the Laplace smoothing parameter. The value of the parameter that produces the highest accuracy will be chosen to be tuned further.

set.seed(12345)
#Initializing vectors
NB_accuracies <- c()
max_NB_accuracy <- 0
max_NB_accuracy_l <- 0

#Running the for loop that will save the accuracy of each model as the value of the Laplacian smoothing parameter is changed. The parameter will take every value from 0 to 100. The accuracies are saved in the empty NB_accuracies vector. The maximum accuracy and the parameter value that corresponds to that maximum accuracy are saved as the variables, max_NB_accuracy and max_accuracy_l.
for (i in 0:100) {
  wine_NB <- naive_bayes(WineQuality ~., data = wineTrain, laplace = i)
  preds_NB <- predict(wine_NB, wineTest)
  NBcm <- confusionMatrix(preds_NB, wineTest$WineQuality)
  accuracy_i <- NBcm$overall["Accuracy"]
  NB_accuracies <- append(NB_accuracies, accuracy_i)
  if (accuracy_i > max_NB_accuracy_l) {
    max_NB_accuracy <- accuracy_i
    max_NB_accuracy_l <- i
  }
}

#Create a data frame where one column is the values of the Laplace smoothing parameter (the numbers from 0 to 100) and the second column is the accuracy that corresponds with each value of the parameter.
NB_accuracy_df <- data.frame(c(0:100), NB_accuracies)
NB_accuracy_df <- NB_accuracy_df %>%
  rename(l = c.0.100., accuracy = NB_accuracies)

#Use the accuracy data frame to create a scatter plot. The intersection of the blue and orange lines will reveal the Laplace smoothing parameter value of the model with the highest accuracy.
NB_accuracy_df %>%
  ggplot(aes(x = l, y = accuracy)) +
  geom_hline(yintercept = max_NB_accuracy, color = "blue") +
  geom_vline(xintercept = max_NB_accuracy_l, color = "orange") +
  geom_point() +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(n.breaks = 20) +
  ggtitle("Model Accuracy by Value of Laplace Smoothing Parameter") +
  labs(y = "Accuracy", x = "Value of Laplace Smoothing Parameter") +
  theme_classic()

The scatter plot reveals that the Laplace smoothing parameter has no effect on the accuracy of the model. The accuracy of this model is also very poor across the board. The accuracy is very close to, if not the same as the NIR model.

The lack of changes to the accuracy based on the Laplace smoothing parameter does make sense, because this parameter is designed to help when an attribute is not present for a data example and in this data set there are no zero probabilities that would be calculated from the data.

Since the value of the Laplace smoothing parameter does not affect the accuracy, the model using the default value of 0 will be re-run and its output shown.

set.seed(12345)
#Creating the model with Laplace smoothing parameter equal to 0, then evaluating its accuracy using the test data.
wine_NB1 <- naive_bayes(WineQuality ~., data = wineTrain, laplace = 0)
preds_NB1 <- predict(wine_NB1, wineTest)
confusionMatrix(preds_NB1, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   3   1   8   5   0   0   0
##          4   1  13  16  14   0   0   0
##          5   0  17 246 163  24   5   0
##          6   1  11 115 218  61  10   0
##          7   1   6  52 257 173  36   1
##          8   0   0   0   2   6   1   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4458          
##                  95% CI : (0.4202, 0.4717)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 0.6133          
##                                           
##                   Kappa : 0.2243          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7  Class: 8
## Sensitivity          0.500000 0.270833   0.5629   0.3308   0.6553 0.0192308
## Specificity          0.990418 0.978154   0.7971   0.7550   0.7066 0.9943463
## Pos Pred Value       0.176471 0.295455   0.5407   0.5240   0.3289 0.1111111
## Neg Pred Value       0.997931 0.975404   0.8113   0.5804   0.9033 0.9650206
## Prevalence           0.004090 0.032720   0.2979   0.4492   0.1800 0.0354465
## Detection Rate       0.002045 0.008862   0.1677   0.1486   0.1179 0.0006817
## Detection Prevalence 0.011588 0.029993   0.3102   0.2836   0.3586 0.0061350
## Balanced Accuracy    0.745209 0.624493   0.6800   0.5429   0.6809 0.5067885
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

As the scatter plot had already revealed, the accuracy of this model is 44.58% which is extremely similar to the NIR of 44.92%. Based on the p-value of 0.6133, the naive Bayes model cannot be said to perform differently from the NIR model.

To see if the accuracy of the model changes, the usekernel parameter will be set to true. This is used to estimate the class conditional densities of the predictors.

set.seed(12345)
#Create the Naive Bayes model and evaluate its accuracy using the test data.
wine_NB2 <- naive_bayes(WineQuality ~., data = wineTrain, laplace = 0,
                          usekernel = TRUE)
preds_NB2 <- predict(wine_NB2, wineTest)
confusionMatrix(preds_NB2, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   1   2   0   0   0
##          4   1  10  11   4   0   0   0
##          5   2  22 267 172  27   4   0
##          6   2  12 128 311  86  16   0
##          7   1   2  30 166 147  30   1
##          8   0   2   0   4   4   2   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5024          
##                  95% CI : (0.4765, 0.5283)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 2.478e-05       
##                                           
##                   Kappa : 0.271           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.208333   0.6110   0.4719   0.5568 0.038462
## Specificity          0.997947 0.988724   0.7796   0.6980   0.8088 0.992933
## Pos Pred Value       0.000000 0.384615   0.5405   0.5604   0.3899 0.166667
## Neg Pred Value       0.995902 0.973629   0.8253   0.6184   0.8927 0.965636
## Prevalence           0.004090 0.032720   0.2979   0.4492   0.1800 0.035446
## Detection Rate       0.000000 0.006817   0.1820   0.2120   0.1002 0.001363
## Detection Prevalence 0.002045 0.017723   0.3367   0.3783   0.2570 0.008180
## Balanced Accuracy    0.498973 0.598529   0.6953   0.5850   0.6828 0.515697
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

Tuning this parameter did result in an improvement to the accuracy of the model. The accuracy is now 50.24% and the p-value demonstrates that this model does perform better than the NIR model.

One more naive Bayes model will be created, this time tuning the usepoisson parameter to TRUE. Now instead of assuming normal distributions, the Poisson distribution is used.

set.seed(12345)
#Create the Naive Bayes model and evaluate its accuracy using the test data.
wine_NB3 <- naive_bayes(WineQuality ~., data = wineTrain, laplace = 0,
                          usekernel = TRUE, usepoisson = TRUE)
preds_NB3 <- predict(wine_NB3, wineTest)
confusionMatrix(preds_NB3, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   1   2   0   0   0
##          4   1  10  11   4   0   0   0
##          5   2  22 267 172  27   4   0
##          6   2  12 128 311  86  16   0
##          7   1   2  30 166 147  30   1
##          8   0   2   0   4   4   2   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5024          
##                  95% CI : (0.4765, 0.5283)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 2.478e-05       
##                                           
##                   Kappa : 0.271           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.208333   0.6110   0.4719   0.5568 0.038462
## Specificity          0.997947 0.988724   0.7796   0.6980   0.8088 0.992933
## Pos Pred Value       0.000000 0.384615   0.5405   0.5604   0.3899 0.166667
## Neg Pred Value       0.995902 0.973629   0.8253   0.6184   0.8927 0.965636
## Prevalence           0.004090 0.032720   0.2979   0.4492   0.1800 0.035446
## Detection Rate       0.000000 0.006817   0.1820   0.2120   0.1002 0.001363
## Detection Prevalence 0.002045 0.017723   0.3367   0.3783   0.2570 0.008180
## Balanced Accuracy    0.498973 0.598529   0.6953   0.5850   0.6828 0.515697
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

Changing the usepoisson parameter to true had no effect on the accuracy of the model. The accuracy of the model is again 50.24% and looking at the confusion matrix reveals that this model made all the same predictions as the previous model.

Naive Bayes Model Comparison and Selection

Overall, the naive Bayes models performed very poorly. Adjusting the Laplace smoothing parameter had no effect on the accuracy of the model, and the accuracy remained at 44.58% which is lower than the NIR of 44.92%. This means the model was so inaccurate that a better classifier would be one that just assigns every wine a rating of 6 without ever having looked at the training data. Fortunately, adjusting the usekernel parameter to true moved the accuracy above the NIR. However, the accuracy was still at only 50.24%. Setting the usepoisson parameter to true had no effect on the model. These models were able to make predictions of quality ratings of 3 through 8, but it was just done so very poorly. All of the models classified wines that were actually rated a 6 or a 7 wrong more often than right.

The second model with the usekernel parameter set to true will be chosen as the final model. Even though this model is being chosen as the representative model for the naive Bayes technique it is important to note that these models are not suited for this classification task.

Results

The final model of each model type will be featured in its own results section. These sections will discuss the strengths and weaknesses of the model, as well as highlight any insights that were provided by the model. Then the results of the two clustering models, followed by the results of classification models, will be compared and discussed. The model that had the best accuracy and came the closest to solving the problem of predicting the wine quality ratings based on these physiochemical properties will be chosen. Finally, there will be a section describing the issues and limitations related to both the data set and models used, which will also suggest improvements that could be applied to future analyses.

Assocation Rule Mining Results

The association rules were able to uncover interesting patterns and relationships found in the data set. The main purpose of this analysis, however, was to see if the association rules generated while targeting right hand sides that were specific quality ratings could be summarized to create a general description what makes a wine poor, average, or good. The association rules were relatively successful in helping to generate these descriptions, although the strength of the results varied depending on what wine quality ratings were being targeted.

The first group of association rules created were those where the right hand side was a wine quality rating of 7, 8, or 9. These association rules showed the strongest similarities to one another out of the three groups. Based on the itemsets a relatively strong description of the properties of the good wines can be created. A good wine generally results from having low volatile acidity, low pH, low alcohol percentage, and a medium amount of citric acid. The descriptors of low and medium are in reference to the complete distribution of each variable across the entire data set. While each of these rules had low support, meaning they did not occur often, they all had 100% confidence, so when the left hand side of the association rule did occur, the right hand side was always an item representing a good wine quality rating.

The second group of association rules created were those where the right hand side was a wine quality rating of 5 or 6. These association rules had the least similarities and the left hand side itemsets were very varied from one rule to another. The only item that had some consistency throughout the rule set was alcohol percentage. An alcohol percentage between 9.2% and 9.5% often showed up on the left hand side of rules where the right hand side was a wine quality rating of 5. Due to the variation in the rule set, a concise description of what makes an average quality wine could not be created. This may be because average wines would have some good qualities and some bad qualities, and which qualities are which will vary from wine to wine. Essentially, the average wines will be somewhat of a random mix of the qualities that lead to good or poor wines.

The third group of association rules created were those where the right hand side was a wine quality rating of 3 or 4. Due to the lack of wine samples that were rated as a 3, only quality ratings of 4 showed up in the right hand side of the item set. While this set of rules was still rather varied, there were slightly more similarities between the rules and enough to create a short general description of what makes a poor wine. The lowest levels of citric acid and free SO2 generally result in a poor quality wine. Considering the application of free SO2 in the wine-making process, too low of levels of this property could lead to the wine spoiling faster or containing unwanted microbes and bacteria. This is a possible explanation for a why a wine would be rated poorly.

The association rule mining analysis was rather successful overall, and it gave insight into what properties to look out for when conducting the clustering and classification analyses.

K-Means Clustering Results

The result of the final k-Means clustering model will be reproduced below for ease of discussion. The final model chosen was the one featuring seven clusters.

wineKM7Bar %>%
  ggplot(aes(x = WineQuality, fill = Clusters)) +
  geom_bar(stat="count") +
  labs(title = "Cluster Assignments by Wine Quality") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 15)) 

The goal of the k-Means clustering analysis was to see if the model would recreate the same groupings as when the data is grouped by wine quality rating. If it were able to, this would give insight into the ability of the values of the properties to consistently correlate with the quality of the wine samples. To aid in the accuracy of the clustering analysis, a normalized version of the data set was used, since the k-Means clustering algorithm forms the clusters based on Euclidean distance calculations.

Unfortunately none of the k-Means models even came close to reproducing any of the wine quality groups. The seven cluster model assigned wine samples from every wine quality group into all the different clusters. This result showed that either the model type or the data was insufficient to accurately group “similar” wines based on the values of the input variables in the same groups as their quality ratings.

Hierarchical Agglomerative Clustering Results

The result of the final HAC model will be reproduced below for ease of discussion. The final model chosen was the one that used the Minkowski distance measure.

plot(dendrogram_color4)

wineClusterResult4 %>%
  ggplot(aes(x = WineQuality, fill = cluster)) +
  geom_bar(stat="count") +
  labs(title = "Cluster Assignments by Wine Quality") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 15))

The goal of the HAC analysis was the same as the k-Means: to see if the clusters would recreate the wine quality rating groups. Since this model type also uses distance calculations, the normalized data was once again used.

Unfortunately, the HAC models were not able to reproduce the wine quality rating groups either. Two of the clusters were very large and the rest of the clusters were very small. However, the wine samples of various quality ratings were still spread across all the different groups. Based on the output there is no way to say that one cluster would represent a quality rating or even a grouping of wine quality ratings like what was done in the association rule mining analysis. Once again, the results showed that either the HAC method or the data were insufficient to create clusters that matched the wine quality rating groupings.

Decision Tree Classification Results

The final decision tree model chosen was the one with the complexity parameter set to a value of 0.003. This tree had 14 splits, and its terminal nodes only featured wine quality ratings of 5, 6, or 7. Even with this massive limitation, it still had the second best accuracy of the decision tree models at 52.71%. The only decision tree with better accuracy was the completely unpruned tree with over 800 splits. Due to this model being completely uninterpretable and unable to be visualized, it was not selected as the final model.

The decision tree and confusion matrix will be reproduced below. Normalization is not required when building decision trees so the original data set was used.

fancyRpartPlot(DT_model2)

confusionMatrix(preds_DT2, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   0   0   0   0   0   0
##          5   1  24 228 124  14   1   0
##          6   4  23 208 497 201  37   0
##          7   1   1   1  38  49  14   1
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5276          
##                  95% CI : (0.5017, 0.5534)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 1.044e-09       
##                                           
##                   Kappa : 0.2262          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000  0.00000   0.5217   0.7542  0.18561  0.00000
## Specificity           1.00000  1.00000   0.8408   0.4146  0.95345  1.00000
## Pos Pred Value            NaN      NaN   0.5816   0.5124  0.46667      NaN
## Neg Pred Value        0.99591  0.96728   0.8056   0.6740  0.84214  0.96455
## Prevalence            0.00409  0.03272   0.2979   0.4492  0.17996  0.03545
## Detection Rate        0.00000  0.00000   0.1554   0.3388  0.03340  0.00000
## Detection Prevalence  0.00000  0.00000   0.2672   0.6612  0.07157  0.00000
## Balanced Accuracy     0.50000  0.50000   0.6813   0.5844  0.56953  0.50000
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

Even though this model was chosen as the final representative model for the decision trees, it does not do a good job of correctly predicting the wine quality rating based on the values of the input variables. This model only classified about 50% of the wines with a quality rating of 5 correctly. It also incorrectly classified the great majority of wines with a quality rating of 7. The model was only successful when classifying wines with a quality rating of 6, and even then it made over 150 mistakes. Once again, this analysis leads to the conclusion that either this type of classification model is not well-suited for the task or the data does not provide enough information to classify the wines by their quality ratings.

Calling the varImp (variable importance) function on the decision tree model can reveal the variables that were most significant when determining the outcomes of the decision tree.

#Display the variable importance in descending order of overall importance.
imp <- as.data.frame(varImp(DT_model2))
imp <- data.frame(overall = imp$Overall,
                  names   = rownames(imp))
imp[order(imp$overall,decreasing = TRUE),]
##      overall           names
## 6  228.41418      PercentAlc
## 11 156.58388 VolatileAcidity
## 3  117.17753         Density
## 1  104.63573       Chlorides
## 5   76.21961         FreeSO2
## 10  74.67564        TotalSO2
## 7   56.37084              pH
## 4   34.73450    FixedAcidity
## 2   18.46877      CitricAcid
## 8   12.51760      ResidSugar
## 9   11.07830        Sulfates

PercentAlc, VolatileAcidity, Density, and Chlorides were the most important variables for this type of model. This provides important information for future analyses. A threshold for importance could be chosen and those variables that exceed the threshold would be retained in a new data set. Those variables that fall below the threshold could be replaced by other properties that may be more helpful in classifying the data according to wine quality rating.

Random Forest Classification Results

The final random forest model chosen was the one that grew 300 decision trees to create its model. The confusion matrix for this model will be reproduced below.

confusionMatrix(predsRF1, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   5   1   1   0   0   0
##          5   2  26 303  78   8   0   0
##          6   4  17 131 533 111  21   0
##          7   0   0   2  46 145  17   1
##          8   0   0   0   1   0  14   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6817          
##                  95% CI : (0.6571, 0.7055)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5016          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000 0.104167   0.6934   0.8088  0.54924 0.269231
## Specificity           1.00000 0.998591   0.8893   0.6485  0.94514 0.999293
## Pos Pred Value            NaN 0.714286   0.7266   0.6524  0.68720 0.933333
## Neg Pred Value        0.99591 0.970548   0.8724   0.8062  0.90525 0.973829
## Prevalence            0.00409 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate        0.00000 0.003408   0.2065   0.3633  0.09884 0.009543
## Detection Prevalence  0.00000 0.004772   0.2843   0.5569  0.14383 0.010225
## Balanced Accuracy     0.50000 0.551379   0.7913   0.7287  0.74719 0.634262
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

While only 68% accuracy is not great, the random forest model performed very well compared to the other types of models. This model was able to correctly classify some of the wines with quality ratings of 4 and 8. It also predicted more than three-fourths of the wines rated as a 5 correctly and more than half of the wines rated as a 7 correctly. These results are notable because that was what many of the other models had the most trouble doing. It appears that a random forest model may be the most well-suited for this classification task. Although with an accuracy below 70%, the results still tend to suggest that there could be an issue with the data. The variables that are present in the data set do not seem to lead to the same classes as the wine quality ratings based on their values.

Support Vector Machine Classification Results

The final support vector machine model chosen was the one that used the radial kernel with a cost parameter equal to 50. The confusion matrix for this model will be reproduced below.

confusionMatrix(preds_SVM_R50, wineDistTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   1   0
##          4   0  13  14   7   0   0   0
##          5   5  23 271 112   8   2   0
##          6   1  11 139 443 112  14   0
##          7   0   1  12  88 135  23   1
##          8   0   0   1   7   9  12   0
##          9   0   0   0   2   0   0   0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5958         
##                  95% CI : (0.5701, 0.621)
##     No Information Rate : 0.4492         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3882         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                       Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.0000000 0.270833   0.6201   0.6722  0.51136  0.23077
## Specificity          0.9993155 0.985201   0.8544   0.6572  0.89609  0.98799
## Pos Pred Value       0.0000000 0.382353   0.6437   0.6153  0.51923  0.41379
## Neg Pred Value       0.9959072 0.975576   0.8413   0.7108  0.89312  0.97218
## Prevalence           0.0040900 0.032720   0.2979   0.4492  0.17996  0.03545
## Detection Rate       0.0000000 0.008862   0.1847   0.3020  0.09202  0.00818
## Detection Prevalence 0.0006817 0.023177   0.2870   0.4908  0.17723  0.01977
## Balanced Accuracy    0.4996578 0.628017   0.7373   0.6647  0.70373  0.60938
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          0.9986357
## Pos Pred Value       0.0000000
## Neg Pred Value       0.9993174
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0013633
## Balanced Accuracy    0.4993179

This model was chosen because its accuracy of 59.58% was the highest of all the SVM models. It was also able to correctly classify some of the wine samples that were rated as a 4 or an 8. Similar to the other types of models, it has difficulty when classifying wines that were rated as a 5 or a 7. It performs best when classifying wines rated as a 6, but even then it made over 200 mistakes.

All the SVM models produced underwhelming results and they lead to the same conclusion as many of the previous model types. Either SVM models are not suited for this classification task or there is an underlying issue with the data that prevents successful classification according to the wine quality ratings.

K Nearest Neighbors Classification Results

The final k Nearest Neighbors model chosen was the model where k was equal to 1. All of the other models created using values of k from 1 to 100 had a significant drop in accuracy when compared to the k = 1 model. The confusion matrix for this model will be reproduced below. Since this algorithm uses distance calculations, the normalized training and testing data sets were used for the analysis.

confusionMatrix(wine_KNN1, test_set_labels$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   2   0   1   0   0
##          4   2  11  17   7   1   0   0
##          5   2  21 283 109  15   3   0
##          6   2  15 114 459  78  16   0
##          7   0   1  15  76 151  16   1
##          8   0   0   6   6  17  17   0
##          9   0   0   0   2   1   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6278          
##                  95% CI : (0.6025, 0.6526)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4435          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.229167   0.6476   0.6965   0.5720  0.32692
## Specificity          0.997947 0.980973   0.8544   0.7215   0.9094  0.97951
## Pos Pred Value       0.000000 0.289474   0.6536   0.6711   0.5808  0.36957
## Neg Pred Value       0.995902 0.974108   0.8511   0.7446   0.9064  0.97537
## Prevalence           0.004090 0.032720   0.2979   0.4492   0.1800  0.03545
## Detection Rate       0.000000 0.007498   0.1929   0.3129   0.1029  0.01159
## Detection Prevalence 0.002045 0.025903   0.2952   0.4663   0.1772  0.03136
## Balanced Accuracy    0.498973 0.605070   0.7510   0.7090   0.7407  0.65321
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          0.9979536
## Pos Pred Value       0.0000000
## Neg Pred Value       0.9993169
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0020450
## Balanced Accuracy    0.4989768

This model was one of the few where at least one prediction corresponded with every available quality rating. Even so, this model had significant shortcomings with an accuracy of only 62.78%. The k nearest neighbors model classified the majority of wines rated a 5, 6, or 7 correctly. It was able to classify some of the wines rated a 4 or 8 correctly, although the model had rather poor accuracy with these. This model was one of the more successful models in the entire data set. Since so many methods are being used and 62.78% is among the best accuracies, the results are again implying that there is an issue with the data set that is not allowing for more successful classifications.

Naive Bayes Classification Results

The final naive Bayes model chosen was the one with a Laplace smoothing parameter of 0 and the usepoisson parameter set to true. The confusion matrix for this model is reproduced below.

confusionMatrix(preds_NB2, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   1   2   0   0   0
##          4   1  10  11   4   0   0   0
##          5   2  22 267 172  27   4   0
##          6   2  12 128 311  86  16   0
##          7   1   2  30 166 147  30   1
##          8   0   2   0   4   4   2   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5024          
##                  95% CI : (0.4765, 0.5283)
##     No Information Rate : 0.4492          
##     P-Value [Acc > NIR] : 2.478e-05       
##                                           
##                   Kappa : 0.271           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.208333   0.6110   0.4719   0.5568 0.038462
## Specificity          0.997947 0.988724   0.7796   0.6980   0.8088 0.992933
## Pos Pred Value       0.000000 0.384615   0.5405   0.5604   0.3899 0.166667
## Neg Pred Value       0.995902 0.973629   0.8253   0.6184   0.8927 0.965636
## Prevalence           0.004090 0.032720   0.2979   0.4492   0.1800 0.035446
## Detection Rate       0.000000 0.006817   0.1820   0.2120   0.1002 0.001363
## Detection Prevalence 0.002045 0.017723   0.3367   0.3783   0.2570 0.008180
## Balanced Accuracy    0.498973 0.598529   0.6953   0.5850   0.6828 0.515697
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

The naive Bayes models performed particularly poorly. Regardless of the value of the Laplace smoothing parameter, all the models created that did not have the usepoisson parameter set to true had worse accuracy than the No Information Rate model. The final naive Bayes model was not much better with an overall accuracy of 50.24%. The confusion matrix reveals that this model actually classifies wines with a rating of 6 incorrectly more often that it does correctly. This is fairly unique behavior compared to the other types of models. Even though the accuracy when classifying wines rated as a 6 is lower, the model correctly classifies the majority of wines rated as a 5 or 7 and was able to predict every wine quality rating except a 9.

Once again, these results imply that either this model type is not well-suited for a classification task involving this data or that it is the data that is causing all the accuracy issues.

Comparison of Clustering Models Results

The k-Means and HAC models returned very similar results. The models were unable to create clusters that matched the wine quality rating groups. In fact, the models mixed all of the different wine quality rating groups into as many different clusters as they could. This resulted in some clusters containing many wine samples from almost all of the wine quality rating groups. While these models did not provide strong results towards the goal of using the physiochemical properties to predict the quality of the wine, it did begin to highlight that the data set might not be well suited for making those predictions.

Comparison of Classification Models Results

The following table summarizes the accuracies of each of the final models chosen from the different types of supervised machine learning techniques used in the analysis.

model_names <- c("Random Forest Model with 300 Trees", "k Nearest Neighbors with k = 1", "SVM with Radial Kernel and Cost = 50", "Decision Tree with Complexity Parameter = 0.003", "Naive Bayes with Laplace = 0 and usepoisson = TRUE")
model_accuracy <- c("68.17%", "62.78%", "59.58%", "52.76%", "50.24%")
model_table <- data.frame(model_names, model_accuracy)
kable(model_table, col.names = c("Final Model", "Accuracy"), caption = "Accuracy Level of Each Final Model")
Accuracy Level of Each Final Model
Final Model Accuracy
Random Forest Model with 300 Trees 68.17%
k Nearest Neighbors with k = 1 62.78%
SVM with Radial Kernel and Cost = 50 59.58%
Decision Tree with Complexity Parameter = 0.003 52.76%
Naive Bayes with Laplace = 0 and usepoisson = TRUE 50.24%

The only two model types that achieved greater than 60% accuracy were the kNN and random forest models. These models, however, still had significant issues. The models had difficulties predicting the lowest and highest wine quality ratings. The decision tree had the worst accuracy overall when classifying wine samples rated a 5, 6, or 7. The naive Bayes model was more successful at classifying wines rated a 5 or 7, but this was offset by having the worst accuracy when classifying wines rated a 6 out of all the models. The SVM, kNN, and random forest models all had better accuracy when classifying wines rated as a 5, 6, or 7, and issues classifying any other quality ratings. The differences in the accuracies of these models boiled down to how successful they were at classifying those wines rated as a 5, 6, or 7.

Taking all of these results into account, the overall conclusion that can be made is that the data set may not be robust enough to support this classification task. If it were, at least one of the model types should have been more successful. The physiochemical properties in the data set may not be the ones that impact the quality rating the most or there could be a problem with the ratings themselves.

Final Model Selection

One of the final models will be selected as the model that made the best effort towards solving the problem of predicting the wine quality ratings using the physiochemical properties provided by the data set. This model is the random forest model with 300 trees. This model had the best accuracy by far, and therefore was able to make the most correct classifications across all the different wine quality ratings.

In one last effort to see if this model type can make more accurate predictions, it will be run with a reduced number of features. The correlation analysis revealed that there was a correlation between Density and PercentAlc and Density and ResidSugar, as well as FreeSO2 and TotalSO2. Correlation among the input variables provides the biggest issues when running linear regressions, but will removing the variable in each of the pairs that the random forest relied on less make a difference for the overall accuracy?

Density was used in many more nodes than PercentAlc and ResidSugar, and TotalSO2 was used in slightly more nodes than FreeSO2. Thus, ResidSugar, PercentAlc, and TotalSO2 will be removed as input variables and one last random forest model will be created.

#Final random forest model with feature selection to see if accuracy improves.
set.seed(12345)
rf_feature <- randomForest(WineQuality ~ FixedAcidity + VolatileAcidity + CitricAcid + Chlorides + FreeSO2 + Density + pH + Sulfates + PercentAlc, data = wineTrain, ntree = 300, importance = TRUE, proximity = TRUE)
predsRF_feature <- predict(rf_feature, newdata = wineTest, type = "class")
confusionMatrix(predsRF_feature, wineTest$WineQuality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   6   3   1   0   0   0
##          5   1  26 294  82   8   0   0
##          6   5  16 137 524 111  18   0
##          7   0   0   3  51 143  20   1
##          8   0   0   0   1   2  14   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6687         
##                  95% CI : (0.644, 0.6928)
##     No Information Rate : 0.4492         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.483          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000 0.125000   0.6728   0.7951  0.54167 0.269231
## Specificity           1.00000 0.997181   0.8864   0.6448  0.93766 0.997880
## Pos Pred Value            NaN 0.600000   0.7153   0.6461  0.65596 0.823529
## Neg Pred Value        0.99591 0.971174   0.8646   0.7942  0.90312 0.973793
## Prevalence            0.00409 0.032720   0.2979   0.4492  0.17996 0.035446
## Detection Rate        0.00000 0.004090   0.2004   0.3572  0.09748 0.009543
## Detection Prevalence  0.00000 0.006817   0.2802   0.5528  0.14860 0.011588
## Balanced Accuracy     0.50000 0.561091   0.7796   0.7200  0.73966 0.633555
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993183
## Prevalence           0.0006817
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

Using the reduced number of input variables negatively impacted the accuracy. It was reduced from 68.17% to 66.87%. Removing some of the correlated variables did not improve the model.

The best model is the 300 tree random forest model using all the input variables. Even though this is the best model, it still has significant problems. It has a very difficult time classifying several of the wine quality ratings. Since this is the best model created, if an additional data set was provided with the physiochemical properties before the wines were judged, this model would return the most similar results when compared to the forthcoming ratings from the wine experts.

Issues, Limitations, and Future Improvements

Based on the results of all the models, there are clearly some factors that are holding back the accuracies of the models. The first possible explanation is that the data set is not appropriate for making these classifications. It could be that the judges are focusing on properties of the wine that do not originate from the values of the physiochemical properties in the data set. While many of these values do influence the taste of the wine, it could be that the flavor is more strongly influenced by other aspects.

The second possible explanation is that the way in which the quality ratings were assigned to the wines make this an especially difficult output to predict. The wines are tasted by at least three judges and the final rating is the median of all the individual ratings. So, as an extreme, but possible example, one judge could love the wine and rate it a 10. Then the next judge could hate the wine and rate it a 2. The third judge doesn’t mind the wine, but isn’t impressed either and rates it a 6. This very divisive wine ends up with a final rating of a 6, but everything that went into that final rating is essentially hidden from the data set. This scenario could also explain why the median values of the distributions of many of the physiochemical properties were so similar for the different wine quality ratings.

These two explanations can be combined to conclude that maybe it is just really difficult to predict how the wines will be judged and that is why the accuracy was low. These judges are all real people with different likes and dislikes, all bringing unique experiences to their judging. Maybe a judge had a really bad day and this will cause them to rate the wines lower than they normally would. Maybe a judge just really prefers sweet wines to dry wines and will rate the dry wines lower regardless of how good they are for a dry wine. There are so many biases and so much subjectivity that can play into this rating process, that it is possible that a model with 68.17% accuracy is actually producing a phenomenal result based on the inherent difficulty of the task.

Besides these issues, the data set had an obvious limitation. Since the final ratings are medians of individual scores they will tend towards the values at the center of the distribution. When many scores are collected, the distribution of quality ratings will actually form a normal distribution. This is an example of the central limit theorem and law of large numbers at work. This means that no matter how much data is collected there will be a much greater number of 5s, 6s, and 7s than any other rating unless some wine samples are purposely excluded. Perhaps a custom data set could be made, where instead of including all the samples, it includes approximately 50 or 100 wine samples in each quality rating category. This much more balanced data set could produce better results, even while having fewer overall data examples to train and test the models on.

Another possibility for future investigations could be to create data sets based on the individual scores of each wine expert before the median is taken. That way separate models could be created that could predict the scores each individual judge is most likely to give to a new wine sample. Since the scores are not being manipulated at all, there is a good chance that a model would be able to more closely replicate the choices of a single judge, instead of trying to build a model that can replicate the combined choices of multiple judges.

Unfortunately this particular data set was limited by copyright issues, so it would not be possible to get any of this additional data on these exact wine samples. However, this does present a fun opportunity to conduct one’s own research by hosting a new round of wine sample taste tests.

Conclusion

Wine has been around for thousands of years, almost as long as civilizations themselves have existed. Over all that time, the process for making wine, as well as the ingredients used, have changed. Each change that becomes a new standard stems from it making a positive impact on the taste of the wine. With many thousands of years of wine-making, a process similar to natural selection takes place. The better quality wines continue to be created and improved upon, while the worse quality wines are left behind. In order to create better and better quality wines, there must be people who understand what parts of the process, whether it is a wine-making step, like how long the wine is aged in a barrel for, or how much of a physiochemical component is present in the wine, most strongly correlate to the wine having a great taste. Since humans are able to do this, can a computer learn to do this as well?

Based on the data available for this analysis, the models were not able to learn the decision making process the judges used when tasting the wine. That resulted in low accuracy when trying to predict what quality rating a wine sample would get when it was presented with the values of its physiochemical properties. It could be that these properties were not the ones that impact the taste of the wine the most, that the taste of the wine is actually more influenced by other parts of the wine-making process, or that this is actually an extremely difficult task based on the subjective nature of scoring wines.

Even though the results were not very positive, they still provide a lot of insight towards being able to predict wine quality ratings in the future. Other data could be used or entirely new surveys could be conducted. Unfortunately, though, there is no guarantee that the accuracy of the predictions would improve in new studies. This may still be one of the tasks that only a human can complete. Directly tasting the wine is a much different experience than a computer processing numbers that “describe” the taste of the wine. For now, any budding wine connoisseurs should continue to rely on the ratings of expert human judges to help them pick out their next favorite drink.